jorvis / biocode

Bioinformatics code libraries and scripts
MIT License
504 stars 247 forks source link

write_fasta_from_gff.py silently ignores a (potentially large) portion of the input gff #32

Closed jonathancrabtree closed 9 years ago

jonathancrabtree commented 9 years ago

Running write_fasta_from_gff.py on the output of convert_metagenemark_gff_to_gff3.py, I observed some large discrepancies between the number of CDS features in the GFF3 file and the number of CDS features written by write_fasta_from_gff.py In one case the GFF3 file contained 16042 CDS features, but the FASTA output contained only 10299 sequences, a loss of 5743 CDS sequences, ~36% of the total.

jonathancrabtree commented 9 years ago

Debugging revealed that the problem lies with get_gff3_features in biocodegff.py. There are two problems with this function:

  1. It interprets any GFF line starting with "##FASTA" as the beginning of the GFF3 FASTA section. I don't think this is correct, at least based on my reading of the GFF3 spec. I'd say that "##FASTA" (possibly with trailing whitespace) is the only thing that should be allowed on the line.
  2. It fails to raise an error when it sees (what it thinks is) FASTA sequence data that is not preceded by a valid FASTA defline. This is actually the more serious of the two errors, and is the one that allows thousands of input genes to be silently dropped.

The problem in this particular case is that MetaGeneMark prints the protein sequences in comments in the GFF, like this:

##Protein 11140
##MAQFRVDSEQIQQAAAAVGTSVSAIRDAVNGMYTNLQQLQSVWTGSAATQFASTAQQWRA 
##AQQQMEQSLEAIQQAMQHASGVYLDAEAQATSLFGMG
##end-Protein

convert_metagenemark_gff_to_gff3.py echoes these comments to the GFF3 file unchanged, which it should not do, because..."FASTA" is a valid amino acid sequence. So all it takes is for "FASTA" to appear at the beginning of any protein line and biocodegff.py will ignore the entire rest of the file, without printing any errors or warnings:

##Protein 10298
##MRMQKVQKKLSETSFQDRLDFAATHSKTSVLRMCNSQCTGLCARDVLRARARFGSNALER
##KKQNSLASRLVQAFINPFSCILFVLALISCINDMVLPSLSLLGQSPDDFDCTTFTIITTM 
##ITVSGILRFVQESKSANAAQKLMDMVRTTVSCLRDGDADEDAVSPSTSATASPSASASLA
##NFSFEDKAKLTEIQLDSLVVGDIVYLSTGDIVPADVRILSACDLFVNEASLTGESELVEK
##FASTATKAANICDYENLAFMGTTVISGSAWAVVVSVGAHTMFGTLARALSEKDGETSFSR
<everything from this point on, except FASTA sequence, is ignored by the parser>
##DINSLSWVLIRFMIVMVPVVLAINGFTKGDW
##end-Protein

I have a partial proposed solution, which consists of:

  1. Requiring that the "##FASTA" be the only thing on the line (although note that this doesn't solve the MetaGeneMark bug, just makes it less likely to be triggered.)
  2. Throwing an exception when what the parser thinks is the FASTA section is not well-formed. In particular, all FASTA sequence must be preceded by a FASTA defline.

Finally, I'll file this as a separate issue, but I think convert_metagenemark_gff_to_gff3.pl should be modified so that it doesn't produce GFF3 that's ambiguous/malformed i.e., no "##FASTA" lines unless it's at the start of a valid FASTA section. One possibility might be to tack an extra "#" on all the echoed comment lines.

jonathancrabtree commented 9 years ago

Closed this prematurely!