jorvis / biocode

Bioinformatics code libraries and scripts
MIT License
504 stars 247 forks source link

convert_gff3_to_gbk.py, add full support for non-protein-coding genes #24

Closed jonathancrabtree closed 4 years ago

jonathancrabtree commented 10 years ago

If convert_gff3_to_gbk.py finds a tRNA, rRNA, or other non protein-coding gene in the input GFF3 it will output the parent "gene" feature in the output GenBank file, but nothing else. Only protein-coding genes with an mRNA feature below the parent gene appear to be converted fully. It looks like biocodegenbank.print_biogene needs to be generalized to handle all gene types, or at least all those that currently have a corresponding representation in the biothings module.

mikemc commented 4 years ago

Came to post an issue but I think I'm having the same problem noted above, so will just add a concrete example of why this is a problem. I am trying to extract 16S sequences that are annotated in a GenBank file (example). The fact that a gene is the 16S sequence is identified by the product name in the GenBank file,

     gene            517900..517988
                     /locus_tag="SAMN05444282_102329"
     rRNA            517900..517988
                     /locus_tag="SAMN05444282_102329"
                     /product="16S ribosomal RNA . Bacterial SSU"

However, the product name doesn't make it into the GFF3 file and so it is impossible to select the 16S sequences downstream separately from other rRNA's,

FNQD01000002    GenBank gene    517900  517988  .   +   .   ID=SAMN05444282_102329;locus_tag=SAMN05444282_102329
FNQD01000002    GenBank rRNA    517900  517988  .   +   .   ID=SAMN05444282_102329.rRNA.1;Parent=SAMN05444282_102329
jorvis commented 4 years ago

I'll see if I can get this added tonight.

jorvis commented 4 years ago

Last night has shifted into today.

jorvis commented 4 years ago

@mikemc Is it possible to attach your GBK file so I can test with it, or is it private?

mikemc commented 4 years ago

@jorvis The example I gave is from this GenBank file

jorvis commented 4 years ago

@mikemc - The current version of the code should fix your issue. The tRNAs now export with anticodon reported and rRNAs with product. I'm not closing this ticket yet, as what @jonathancrabtree reported is actually the reverse conversion, going from GFF3 -> GBK.

jorvis commented 4 years ago

Closing. I've now confirmed retention of annotation of tRNAs and rRNAs from source Genbank Flat file, converted to GFF3, then converted back to Genbank.

mikemc commented 4 years ago

Great, thanks @jorvis! I haven't had a chance to test yet but sounds like this covers my issue.