hyattpd / Prodigal

Prodigal Gene Prediction Software
GNU General Public License v3.0
432 stars 85 forks source link

Genes are predicted across N stretches even when providing the -m flag #80

Open fpusan opened 4 years ago

fpusan commented 4 years ago

Hi guys, hope the week is going well!

As per the title, prodigal v2.6.3 is predicting genes across N stretches even when supplied with the -m flag. According to the documentation -m should make prodigal treat Ns as masked segments of the sequence.

A reproducible example would be:

echo -e ">test\nCCAGACCAAGCGCCGCCAGCATCAGGCTGGACGGCTCCGGGATGACCTGGATGTCCATCGCGAATTGGCCCGTCGTCCCGTCAAGGAAGAAGAAACCCGAGATGTCGCCGTTGTAGAAATCAACCGGCTGGTTGGTATACTCCTGGCCAATGATATTGCCAAGATCGTTCGTACCCGTGAGGGAGATAAACCCGGTCGAGACCTCGACGTCCGTGTCCGTATTTAAGAGGTCAAACCCGGGATCGATGTTGGCATCGTTGTCGAACGAGACCAGAATCGGCGCTGCGGACGCGCTGATCGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN" | prodigal -p meta -m -f gff`

and its output is:

-------------------------------------
PRODIGAL v2.6.3 [February, 2016]         
Univ of Tenn / Oak Ridge National Lab
Doug Hyatt, Loren Hauser, et al.     
-------------------------------------
Request:  Metagenomic, Phase:  Training
Initializing training files...done!
-------------------------------------
Request:  Metagenomic, Phase:  Gene Finding
Finding genes in sequence #1 (371 bp)...done!
##gff-version  3
# Sequence Data: seqnum=1;seqlen=371;seqhdr="test"
# Model Data: version=Prodigal.v2.6.3;run_type=Metagenomic;model="41|Shigella_dysenteriae_Sd197|B|51.2|11|1";gc_cont=51.20;transl_table=11;uses_sd=1
test    Prodigal_v2.6.3 CDS 2   370 37.3    -   0   ID=1_1;partial=11;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.656;conf=99.98;score=37.28;cscore=35.68;sscore=1.61;rscore=0.00;uscore=0.00;tscore=1.61;

Note the predicted CDS going from 2 to 370, even if Ns start at position 301 of the input sequence.

Is this the expected behaviour?

Best regards,

Fernando

GabeAl commented 3 years ago

Same issue, and this was also reported a lot here and various places on the web.

Since I don't think there will ever be a v3.0 (a coming v3.0 is usually cited as the reason this rather severe bug never gets fixed), perhaps one of us should make a pull request to fix this. :+1:

If I have time maybe I'll try diving in...

GabeAl commented 3 years ago

See this other comment as well (shouldn't have been closed IMO): https://github.com/hyattpd/Prodigal/issues/30

So I'm doing some digging and at first glance I see -m triggers the global "do_mask" which just tallies the integer for "number masked" (nm) via the start and end positions.

I guess it all gets lumped into one variable controlling a mask penalty. If it's short enough, it won't actually do what the behavior promised on the main page. Something reminds me there was a threshold of 50 (like, there have to be 50 masked or such before it actually stops looking).

But all this is just coming together for me, not sure yet. If it's a simple penalty threshold I'll just bump the increment (i.e. from +1 to +5 or some such until it finally stops looking across N's).

GabeAl commented 3 years ago

Okay so after printing some debug statements it looks like it never populates the mask variable at all. Literally nothing gets added at any point to the mask variable despite the code starting to make the mask (it never completes a mask).

This is indeed because MASK_SIZE is set to 50 (which is a bit too high IMO), which is the threshold for adding a mask. I've changed it to 16.

In highly fragmented but linearized genomes this may bump the number of masks to a very high value (>5000 easily if there were as many contigs). So I've bumped this up as well. Unfortunately the algorithm for mask detection is brute force search (ideally something smarter like a hash or binary search would be used here) so if you get up to a high number of masks, you're out of luck.

Simplest was to qsort the ranges using the start-of-range, then fuzzy binary search to find nearest start range of query (<20 checks for a million ranges!), and continue checking forward linearly until the start goes out of range (only 2-3 checks typically). When you have thousands of contig-join-sites in very large genomes, this can speed things up quite a bit. Also, replacing the multiple calls to this function throughout multiple if-else statements to just one is trivial and reduces the # of calls to the function.