BUG: `augur translate` not faithful to `gff3` standard, can't annotate nucs, which are required.

Related to #881

Currently, augur translate does not implement gff3 standard properly. Confusingly, if you use .gff files generated automatically by Genbank from a given .gb file, augur translate will behave differently.

Importantly, nucs won't be annotated if you use a genemap from gff rather than gb which is clearly a bug.

The solution is to: a) make augur translate accept standard gff3 files, outputting nuc annotations etc b) make augur translate behave identical whether it reads in a .gb or .gff annotation (that are identical in information).

I had opened #950 for a stopgap fix for the "no nuc annotation with .gff" bug, but the actual solution is to read .gff properly and refactor augur translate to be standards conforming.

Below are comments made on the issue and PR respectively.

I got extremely confused by this bug. I encountered it as the following error in a workflow that uses augur translate and just has a .gff as input, not a .gb with nuc annotation.

The error is:
Validating schema of 'auspice/monkeypox_global.json'...
        ERROR: 'nuc' is a required property. Trace: properties - meta - properties - genome_annotations - required
Validation of 'auspice/monkeypox_global.json' failed.

------------------------
Validation of auspice/monkeypox_global.json failed. Please check this in a local instance of `auspice`, as it is not expected to display correctly. 
------------------------
Thrown by augur export. I couldn't figure out how to get it to just read nuc from the .gff, which I assumed was possible. Couldn't believe it isn't. But that's the way it seems to be 😬

This is a very serious bug, it really makes using the same genemap for Nextclade and Nextclade reference build of monkeypox impossible which is not good, discrepancies arise like this.

Currently, augur translate cannot output nuc annotation, if .gff is used as input. This is due to the features being loaded being overly restrictive to genes.

Allowing source to be read in as features allows the .gff to contain nuc annotations in the following format:

MT903344.1 Genbank source 1 197233 . + . locus_tag=nuc That way augur export doesn't crash if a .gff` is used as annotation input for translate.

Related issue(s) Related to https://github.com/nextstrain/augur/issues/881

This fix allows me to use the same genemap.gff for both Nextclade and the workflow that creates the Nextclade ref tree. This is very good for consistency and reduced maintenance effort when Monkeypox genemap is changing a lot.

Testing I don't know if this touches any other functionality. It's a util function so there's a risk. But then at the same time, it shouldn't be wrong to output a genemap with nuc annotated.

CI passes so looks good. One may want to systematically check every function where this gff load utility is used. But first would be good to see if this PR is usable or it has some systematic flaw.

From @huddlej:

We discussed during triage meeting today that it would be good to know which parts of Augur call this function [read gff], to get a sense of the side-effects this change might have. Along those same lines, it would be awesome to see a functional test for the use case you describe here to confirm that this simple change has the effect you want and that future changes to this code won't break that functionality.

From @emmahodcroft

It's been sooooo long since I looked at this, and I would have to go dig to go find old TB files (and I'm travelling without external HDD at the moment), but I'd be a little cautious here, as it flickers some vague memory inside of me from when I used to TB stuff.

Unfortunately I really don't remember much of this so it would take some digging for me to remember.... so mostly just here to wave a caution flag to ensure that this works how we expect across VCF and various GFF files for, ex, bacteria.

it would be good to know which parts of Augur call this function

augur translate parses GFF and GenBank files using utils.load_features, which relies on BCBio.GFF and Bio.SeqIO, respectively. No other code in augur uses load_features. Some comments in that code indicate it was implemented just for TB.

I'd propose making two functions io.read_gff and io.read_genbank and adding a whole bunch of tests!

just here to wave a caution flag to ensure that this works how we expect across VCF and various GFF files for, ex, bacteria

I had this thought too! We have 2 TB tests (tests/builds/tb and tests/builds/tb_drm) however they're (a) not run by CI and (b) the outputs not diff'd, so we'd only know about a breaking change if it caused a command to fail. I just ran the snakefiles in those directories and they still work, so that's something!

Parsing GFF

BioPython doesn't have a parser but recommends gffutils. gffutils has a small set of dependencies but uses a database layer which seemed a bit overkill (although doesn't seem slow in my testing):

# using the TB GFF which is in this repo & the gffutils incantation in their tutorial
db = gffutils.create_db("Mtb_H37Rv_NCBI_Annot.gff", dbfn="test.db", force=True, keep_order=True, sort_attribute_values=True, merge_strategy='merge')
# are all features read? GFF file has 8407 non-comment lines
len(list(db.all_features())) # 8407
region = [feat for feat in db.all_features() if feat.featuretype=='region'][0]
region.start # 1
region.stop # 4411532
region.strand # '+'

# The same, but for MPX genome MT903344 which Cornelius references above
db = gffutils.create_db("MT903344.gff3", dbfn="test.db", force=True, keep_order=True, sort_attribute_values=True, merge_strategy='merge') 
len(list(db.all_features())) # 381, which matches my expectation
region = [feat for feat in db.all_features() if feat.featuretype=='region'][0]
region.start #1
region.end # 197233
region.strand # '+'

name vs locus_tag vs gene

Whichever library we use we're going to have to make some decisions (in code or via arguments) about what name to use for features -- id vs name vs locus_tag -- see https://github.com/nextstrain/augurlinos/issues/4 for some discussion about this... 5 years ago! Looking at genes (featuretype=='gene'), which is what utils.load_features does, we have the following situations from the two above GFFs (using an example gene from each):

GFF	`.id`	`.attributes['Name']`	`.attributes['locus_tag']`	`.attributes['gene']`
TB	gene10	ppiA_Rv0009	Rv0009	ppiA
MPX	gene-MPXV-UK_P2-013	MPXV-UK_P2-013	MPXV-UK_P2-013	`KeyError`

Differences in how we treat GFF vs GenBank

For GFF we use 'gene' types ("featuretype") not CDS. For Genbank we do the opposite.
For the name of each feature, for GFFs we try to use gene (annotation name, not feature type) and fallback to locus_tag if that doesn't exist. For Genbank we do the opposite.

I just ran into this issue, too, trying to use the Nextclade gene map GFF as an input to augur translate. After thinking about this more and reading the comments above, I had a couple of additional thoughts:

+1 for James’s suggestions to create separate read functions for GFF and GenBank files and lots of functional tests for both!
+1 for replacing BCBio.GFF with gffutils, since the latter is maintained (and recommended by BCBio’s author).
We do run the TB and TB_DRM tests in CI through the test runner script.
~None of our example or production GFF files include “a comment that identifies the file format and version”, as required by GFF3 standard. I suspect none of our files will pass a GFF validator, for this simple (if silly) reason.~ (Edit: This is not true; the test GFFs in Augur and most of the Nextclade GFFs define versions and would validate. See subsequent comments below.)
~Few of~ our example or production GFF files (should) define the seqname field. The start and end coordinates we provide in a GFF only make sense in relation to a specific sequence which we need to define in this field. See the monkeypox gene map as an example of where this is done properly. (Edit: Most Nextclade GFFs do include sequence id!)
Instead of using the source value of the type column plus gene=nuc approach that we use in the monkeypox gene map, we should use the sequence-region pragma to define the complete start and end coordinates for each sequence id referenced in the file (this line of the monkeypox gene map). We'd need to confirm that gffutils parses this as expected, but this appears to be the standard way to represent this information.
Building on the discussion 5 years ago about the lack of standard naming conventions for the qualifier keys, the GFF3 specification does define expected key names for genes which include ID and Name. Since INSDC databases require a locus_tag qualifier for all registered genomes, we should still accept (and perhaps prefer) this qualifier even when ID and Name are provided. We should also consider allowing the user to specific the qualifier used for gene names in the GFF file they pass to any augur command that calls the forthcoming io.read_gff function, for backwards compatibility with our own data like the Nextclade gene maps that use gene_name as the qualifier key.

As an example of how we might properly format our GFF3 files for parsing in Augur (and elsewhere), here is the gene map for SARS-CoV-2 from nextclade dataset get --name sars-cov-2:

# Gene map (genome annotation) of SARS-CoV-2 in GFF format.
# For gene map purpses we only need some of the columns. We substitute unused values with "." as per GFF spec.
# See GFF format reference at https://www.ensembl.org/info/website/upload/gff.html
# seqname       source  feature start   end     score   strand  frame   attribute
.       .       gene    26245   26472   .       +       .        gene_name=E
.       .       gene    26523   27191   .       +       .        gene_name=M
.       .       gene    28274   29533   .       +       .        gene_name=N
.       .       gene    266     13468   .       +       .        gene_name=ORF1a
.       .       gene    13468   21555   .       +       .        gene_name=ORF1b
.       .       gene    25393   26220   .       +       .        gene_name=ORF3a
.       .       gene    27202   27387   .       +       .        gene_name=ORF6
.       .       gene    27394   27759   .       +       .        gene_name=ORF7a
.       .       gene    27756   27887   .       +       .        gene_name=ORF7b
.       .       gene    27894   28259   .       +       .        gene_name=ORF8
.       .       gene    28284   28577   .       +       .        gene_name=ORF9b
.       .       gene    21563   25384   .       +       .        gene_name=S

And here is what I would propose it should look like instead:

##gff-version 3.1.26
##sequence-region MN908947 1 29903
# seqname       source  feature start   end     score   strand  frame   attribute
MN908947       GenBank       gene    26245   26472   .       +       .        ID=E
MN908947       GenBank       gene    26523   27191   .       +       .        ID=M
MN908947       GenBank       gene    28274   29533   .       +       .        ID=N
MN908947       GenBank       gene    266     13468   .       +       .        ID=ORF1a
MN908947       GenBank       gene    13468   21555   .       +       .        ID=ORF1b
MN908947       GenBank       gene    25393   26220   .       +       .        ID=ORF3a
MN908947       GenBank       gene    27202   27387   .       +       .        ID=ORF6
MN908947       GenBank       gene    27394   27759   .       +       .        ID=ORF7a
MN908947       GenBank       gene    27756   27887   .       +       .        ID=ORF7b
MN908947       GenBank       gene    27894   28259   .       +       .        ID=ORF8
MN908947       GenBank       gene    28284   28577   .       +       .        ID=ORF9b
MN908947       GenBank       gene    21563   25384   .       +       .        ID=S

My general feeling now is that we might:

spend some time with the GFF3 specification, making sure we actually understand what we want our files to look like
write tests for what files we should be able to parse
implement io.read_gff to follow the expected GFF specification but with backwards compatibility for existing formats
implement proper GFF3 formats for all of our own files

I forgot to include context of what I was doing to run into this same issue: I've been developing an updated Nextstrain introductory tutorial to expand the current "Zika tutorial" and use SARS-CoV-2 instead. I was hoping to use nextclade dataset get as a way to provision the reference data and gene map users would need to run all of the subsequent augur commands without needing a separate GenBank file. The nextclade dataset interface seems generally like a great way to expose users to curated references for different pathogens. So, my hope is that the changes we're talking about in this issue would eventually allow us to use these data in an updated tutorial/workshop setting.

The immediate blocking issue in the new tutorial case is that augur translate doesn't know to look for a gene_name qualifier, but it's clear from the conversation above that this is a smaller part of a bigger rewrite that needs to happen.

Looking at other Nextclade datasets today, I noticed that most others follow the general GFF3 pattern we'd expect. For example, the H3N2 HA gene map looks like this:

##gff-version 3
##sequence-region CY163680.1 1 1737
CY163680.1  feature gene    18  65  .   +   .   gene_name=SigPep
CY163680.1  feature gene    66  1052    .   +   .   gene_name=HA1
CY163680.1  feature gene    1053    1715    .   +   .   gene_name=HA2

This file validates as expected with the GFF3 validator. Also, this file works well with gffutils. For example, I can run the following code to extract the HA1 sequence from the corresponding reference FASTA file:

import gffutils

# Try to use any valid qualifier key as an id in the given order.
db = gffutils.create_db("genemap.gff", ":memory:", id_spec=["locus_tag", "gene", "gene_name"]

# Get coordinates for HA1.
db["HA1"]

# Get sequence for HA1 from a FASTA file.
db["HA1"].sequence("reference.fasta")

I like that gffutils lets us specify a list of valid qualifier keys to use as the gene id. This feature would eliminate the need to manually check for each separate key we support in utils.load_features.

On the other hand, I don't see any functionality in gffutils that parses the sequence-region pragma in a way that would properly give us the full extent of a sequence for the nuc annotation. Interestingly, the BCBio parser does parse this pragma. For example, here is how it parses the TB GFF in augur's tests:

>>> from BCBio import GFF
>>> record = list(GFF.parse("tests//builds/tb/data/Mtb_H37Rv_NCBI_Annot.gff"))[0]
>>> record.annotations
{'gff-version': ['3'],
 'sequence-region': [('MTB_anc', 0, 4411532)],
 'species': ['https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=83332']}

I haven't found any examples of GFF files where people set the type column value to source as a way of defining the whole range of coordinates for the sequence (as we do in #976). Instead, the GFF spec suggests that the region value might be the closest to a standard for this functionality. (Edit: I do see now that this is the standard name for this coordinate range in GenBank files like the Zika reference we use.)

+1 for "spend[ing] some time with the GFF3 specification"

gffutils doesn't parse any directives/pragmas, but it does extract them. Parsing is a simple matter after that:

>>> { fields[0]: tuple(fields[1:]) for fields in (line.split() for line in db.directives) }
{'gff-version': ('3',), 'sequence-region': ('CY163680.1', '1', '1737')}

nextstrain / augur