Closed jonathancrabtree closed 9 years ago
Debugging revealed that the problem lies with get_gff3_features in biocodegff.py. There are two problems with this function:
The problem in this particular case is that MetaGeneMark prints the protein sequences in comments in the GFF, like this:
##Protein 11140
##MAQFRVDSEQIQQAAAAVGTSVSAIRDAVNGMYTNLQQLQSVWTGSAATQFASTAQQWRA
##AQQQMEQSLEAIQQAMQHASGVYLDAEAQATSLFGMG
##end-Protein
convert_metagenemark_gff_to_gff3.py echoes these comments to the GFF3 file unchanged, which it should not do, because..."FASTA" is a valid amino acid sequence. So all it takes is for "FASTA" to appear at the beginning of any protein line and biocodegff.py will ignore the entire rest of the file, without printing any errors or warnings:
##Protein 10298
##MRMQKVQKKLSETSFQDRLDFAATHSKTSVLRMCNSQCTGLCARDVLRARARFGSNALER
##KKQNSLASRLVQAFINPFSCILFVLALISCINDMVLPSLSLLGQSPDDFDCTTFTIITTM
##ITVSGILRFVQESKSANAAQKLMDMVRTTVSCLRDGDADEDAVSPSTSATASPSASASLA
##NFSFEDKAKLTEIQLDSLVVGDIVYLSTGDIVPADVRILSACDLFVNEASLTGESELVEK
##FASTATKAANICDYENLAFMGTTVISGSAWAVVVSVGAHTMFGTLARALSEKDGETSFSR
<everything from this point on, except FASTA sequence, is ignored by the parser>
##DINSLSWVLIRFMIVMVPVVLAINGFTKGDW
##end-Protein
I have a partial proposed solution, which consists of:
Finally, I'll file this as a separate issue, but I think convert_metagenemark_gff_to_gff3.pl should be modified so that it doesn't produce GFF3 that's ambiguous/malformed i.e., no "##FASTA" lines unless it's at the start of a valid FASTA section. One possibility might be to tack an extra "#" on all the echoed comment lines.
Closed this prematurely!
Running write_fasta_from_gff.py on the output of convert_metagenemark_gff_to_gff3.py, I observed some large discrepancies between the number of CDS features in the GFF3 file and the number of CDS features written by write_fasta_from_gff.py In one case the GFF3 file contained 16042 CDS features, but the FASTA output contained only 10299 sequences, a loss of 5743 CDS sequences, ~36% of the total.