Closed corneliusroemer closed 10 months ago
it would be good to know which parts of Augur call this function
augur translate
parses GFF and GenBank files using utils.load_features, which relies on BCBio.GFF and Bio.SeqIO, respectively. No other code in augur uses load_features
. Some comments in that code indicate it was implemented just for TB.
I'd propose making two functions io.read_gff
and io.read_genbank
and adding a whole bunch of tests!
just here to wave a caution flag to ensure that this works how we expect across VCF and various GFF files for, ex, bacteria
I had this thought too! We have 2 TB tests (tests/builds/tb
and tests/builds/tb_drm
) however they're (a) not run by CI and (b) the outputs not diff'd, so we'd only know about a breaking change if it caused a command to fail. I just ran the snakefiles in those directories and they still work, so that's something!
BioPython doesn't have a parser but recommends gffutils. gffutils has a small set of dependencies but uses a database layer which seemed a bit overkill (although doesn't seem slow in my testing):
# using the TB GFF which is in this repo & the gffutils incantation in their tutorial
db = gffutils.create_db("Mtb_H37Rv_NCBI_Annot.gff", dbfn="test.db", force=True, keep_order=True, sort_attribute_values=True, merge_strategy='merge')
# are all features read? GFF file has 8407 non-comment lines
len(list(db.all_features())) # 8407
region = [feat for feat in db.all_features() if feat.featuretype=='region'][0]
region.start # 1
region.stop # 4411532
region.strand # '+'
# The same, but for MPX genome MT903344 which Cornelius references above
db = gffutils.create_db("MT903344.gff3", dbfn="test.db", force=True, keep_order=True, sort_attribute_values=True, merge_strategy='merge')
len(list(db.all_features())) # 381, which matches my expectation
region = [feat for feat in db.all_features() if feat.featuretype=='region'][0]
region.start #1
region.end # 197233
region.strand # '+'
Whichever library we use we're going to have to make some decisions (in code or via arguments) about what name to use for features -- id vs name vs locus_tag -- see https://github.com/nextstrain/augurlinos/issues/4 for some discussion about this... 5 years ago! Looking at genes (featuretype=='gene'
), which is what utils.load_features
does, we have the following situations from the two above GFFs (using an example gene from each):
GFF | .id |
.attributes['Name'] |
.attributes['locus_tag'] |
.attributes['gene'] |
---|---|---|---|---|
TB | gene10 | ppiA_Rv0009 | Rv0009 | ppiA |
MPX | gene-MPXV-UK_P2-013 | MPXV-UK_P2-013 | MPXV-UK_P2-013 | KeyError |
gene
(annotation name, not feature type) and fallback to locus_tag
if that doesn't exist. For Genbank we do the opposite. I just ran into this issue, too, trying to use the Nextclade gene map GFF as an input to augur translate. After thinking about this more and reading the comments above, I had a couple of additional thoughts:
seqname
field. The start
and end
coordinates we provide in a GFF only make sense in relation to a specific sequence which we need to define in this field. See the monkeypox gene map as an example of where this is done properly. (Edit: Most Nextclade GFFs do include sequence id!)source
value of the type
column plus gene=nuc
approach that we use in the monkeypox gene map, we should use the sequence-region
pragma to define the complete start and end coordinates for each sequence id referenced in the file (this line of the monkeypox gene map). We'd need to confirm that gffutils parses this as expected, but this appears to be the standard way to represent this information.ID
and Name
. Since INSDC databases require a locus_tag
qualifier for all registered genomes, we should still accept (and perhaps prefer) this qualifier even when ID
and Name
are provided. We should also consider allowing the user to specific the qualifier used for gene names in the GFF file they pass to any augur command that calls the forthcoming io.read_gff
function, for backwards compatibility with our own data like the Nextclade gene maps that use gene_name
as the qualifier key.As an example of how we might properly format our GFF3 files for parsing in Augur (and elsewhere), here is the gene map for SARS-CoV-2 from nextclade dataset get --name sars-cov-2
:
# Gene map (genome annotation) of SARS-CoV-2 in GFF format.
# For gene map purpses we only need some of the columns. We substitute unused values with "." as per GFF spec.
# See GFF format reference at https://www.ensembl.org/info/website/upload/gff.html
# seqname source feature start end score strand frame attribute
. . gene 26245 26472 . + . gene_name=E
. . gene 26523 27191 . + . gene_name=M
. . gene 28274 29533 . + . gene_name=N
. . gene 266 13468 . + . gene_name=ORF1a
. . gene 13468 21555 . + . gene_name=ORF1b
. . gene 25393 26220 . + . gene_name=ORF3a
. . gene 27202 27387 . + . gene_name=ORF6
. . gene 27394 27759 . + . gene_name=ORF7a
. . gene 27756 27887 . + . gene_name=ORF7b
. . gene 27894 28259 . + . gene_name=ORF8
. . gene 28284 28577 . + . gene_name=ORF9b
. . gene 21563 25384 . + . gene_name=S
And here is what I would propose it should look like instead:
##gff-version 3.1.26
##sequence-region MN908947 1 29903
# seqname source feature start end score strand frame attribute
MN908947 GenBank gene 26245 26472 . + . ID=E
MN908947 GenBank gene 26523 27191 . + . ID=M
MN908947 GenBank gene 28274 29533 . + . ID=N
MN908947 GenBank gene 266 13468 . + . ID=ORF1a
MN908947 GenBank gene 13468 21555 . + . ID=ORF1b
MN908947 GenBank gene 25393 26220 . + . ID=ORF3a
MN908947 GenBank gene 27202 27387 . + . ID=ORF6
MN908947 GenBank gene 27394 27759 . + . ID=ORF7a
MN908947 GenBank gene 27756 27887 . + . ID=ORF7b
MN908947 GenBank gene 27894 28259 . + . ID=ORF8
MN908947 GenBank gene 28284 28577 . + . ID=ORF9b
MN908947 GenBank gene 21563 25384 . + . ID=S
My general feeling now is that we might:
io.read_gff
to follow the expected GFF specification but with backwards compatibility for existing formatsI forgot to include context of what I was doing to run into this same issue: I've been developing an updated Nextstrain introductory tutorial to expand the current "Zika tutorial" and use SARS-CoV-2 instead. I was hoping to use nextclade dataset get
as a way to provision the reference data and gene map users would need to run all of the subsequent augur commands without needing a separate GenBank file. The nextclade dataset
interface seems generally like a great way to expose users to curated references for different pathogens. So, my hope is that the changes we're talking about in this issue would eventually allow us to use these data in an updated tutorial/workshop setting.
The immediate blocking issue in the new tutorial case is that augur translate
doesn't know to look for a gene_name
qualifier, but it's clear from the conversation above that this is a smaller part of a bigger rewrite that needs to happen.
Looking at other Nextclade datasets today, I noticed that most others follow the general GFF3 pattern we'd expect. For example, the H3N2 HA gene map looks like this:
##gff-version 3
##sequence-region CY163680.1 1 1737
CY163680.1 feature gene 18 65 . + . gene_name=SigPep
CY163680.1 feature gene 66 1052 . + . gene_name=HA1
CY163680.1 feature gene 1053 1715 . + . gene_name=HA2
This file validates as expected with the GFF3 validator. Also, this file works well with gffutils. For example, I can run the following code to extract the HA1 sequence from the corresponding reference FASTA file:
import gffutils
# Try to use any valid qualifier key as an id in the given order.
db = gffutils.create_db("genemap.gff", ":memory:", id_spec=["locus_tag", "gene", "gene_name"]
# Get coordinates for HA1.
db["HA1"]
# Get sequence for HA1 from a FASTA file.
db["HA1"].sequence("reference.fasta")
I like that gffutils lets us specify a list of valid qualifier keys to use as the gene id. This feature would eliminate the need to manually check for each separate key we support in utils.load_features
.
On the other hand, I don't see any functionality in gffutils that parses the sequence-region
pragma in a way that would properly give us the full extent of a sequence for the nuc
annotation. Interestingly, the BCBio parser does parse this pragma. For example, here is how it parses the TB GFF in augur's tests:
>>> from BCBio import GFF
>>> record = list(GFF.parse("tests//builds/tb/data/Mtb_H37Rv_NCBI_Annot.gff"))[0]
>>> record.annotations
{'gff-version': ['3'],
'sequence-region': [('MTB_anc', 0, 4411532)],
'species': ['https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=83332']}
I haven't found any examples of GFF files where people set the type
column value to source
as a way of defining the whole range of coordinates for the sequence (as we do in #976). Instead, the GFF spec suggests that the region
value might be the closest to a standard for this functionality. (Edit: I do see now that this is the standard name for this coordinate range in GenBank files like the Zika reference we use.)
+1 for "spend[ing] some time with the GFF3 specification"
gffutils doesn't parse any directives/pragmas, but it does extract them. Parsing is a simple matter after that:
>>> { fields[0]: tuple(fields[1:]) for fields in (line.split() for line in db.directives) }
{'gff-version': ('3',), 'sequence-region': ('CY163680.1', '1', '1737')}
Related to #881
Currently,
augur translate
does not implementgff3
standard properly. Confusingly, if you use.gff
files generated automatically by Genbank from a given.gb
file, augur translate will behave differently.Importantly,
nuc
s won't be annotated if you use a genemap fromgff
rather thangb
which is clearly a bug.The solution is to: a) make augur translate accept standard gff3 files, outputting nuc annotations etc b) make augur translate behave identical whether it reads in a
.gb
or.gff
annotation (that are identical in information).I had opened #950 for a stopgap fix for the "no nuc annotation with .gff" bug, but the actual solution is to read
.gff
properly and refactoraugur translate
to be standards conforming.Below are comments made on the issue and PR respectively.
I got extremely confused by this bug. I encountered it as the following error in a workflow that uses augur translate and just has a .gff as input, not a .gb with nuc annotation.
From @huddlej:
From @emmahodcroft