nextstrain / augurlinos

A collection of modules for molecular epidemiology
1 stars 4 forks source link

'Locus_tag' or 'gene' when reading in GB/GFF files? #4

Open emmahodcroft opened 6 years ago

emmahodcroft commented 6 years ago

Currently load_features in util.py looks for 'feature type' "CDS" and 'feature qualifier' "locus_tag" to get the gene name. For Zika this works as expected. However, for TB, "CDS" does not contain "locus_tag" (or anything useful). Further, in "gene" (instead of CDS) the "locus_tag" is not the identifier commonly used for genes (ex: gene='dnaA' locus_tag='Rv0001').

For GFF files I have modified load_features (in 'vcf' branch) so that looks for 'gene' and 'gene' instead of 'CDS' and 'locus_tag'. However, for avian influenza Genbank files, the combination should be 'CDS' and 'gene' (this returns the expected PB2, HA, NA, etc).

We should probably either look to see if there is a general rule (or two) we can put in place to ensure we're always getting the common gene names, or we should consider turning this into an option of some kind for users to specify.

rneher commented 6 years ago

Unfortunately, there doesn't seem to be a general rule for this. Most bacterial genome annotation software add a locus tag to the annotation and NCBI requires this as far as I know. That's why I thought the locus_tag would be a good field to use.