lmrodriguezr / nonpareil

Estimate metagenomic coverage and sequence diversity
http://enve-omics.ce.gatech.edu/nonpareil/
Other
44 stars 11 forks source link

"Picking N random sequences" forever #42

Closed nick-youngblut closed 4 years ago

nick-youngblut commented 4 years ago

I'm running Nonpareil v3.303, and for one particular sample in a publicly available metagenome (accession SRS015133), nonpareil runs Picking N random sequences forever (I've let it run for many hours). The other samples in that bioproject work just fine.

My command:

nonpareil -T kmer   -R 3000    -t 8   -f fastq   -s $INPUT_FASTQ  -b $BASENAME

Even if I use -X 100, the command runs forever. If I append reads from another sample, nonpareil runs successfully. It appears that nonpareil is running a while loop to randomly select sequences, and breaks the loop once enough "good" sequences are found. However, in this case, nonpareil is never able to find enough "good" sequences in sample SRS015133. Reducing the kmer size (-k) to 14 allows the nonpareil to finish successfully, but raising the length any higher will cause an infinite loop. Why is nonpareil having a problem with a kmer length of 24? I don't see any cutoffs for sequence quality.

Here's the first couple of reads from the fastq (No. of reads = 1 mil):

@SRR061182.9 HWUSI-EAS677_102343534:3:1:1031:5135 length=100
CATGAATTGAGAGCNCTGGTCAGTGAATCGCTTCAAGATTTGGAAATGCACTTGCAGAAAGAGGAAAACGTATTGTTCCCTTATTTGTACGATTTGTATG
+
GGGGEGEBGDCCCC!AACA?E=EEEDEDD=EDEBD?EBEEAEDED?ED?EDBEEDAE?DAEEE=EE=E?E@CC,E@DA=CA->C=AA3565==<7:BC-@
@SRR061182.10 HWUSI-EAS677_102343534:3:1:1031:3309 length=100
CTGATTGCTAGTCTNTGCAGCCATTAGGACAGTTGATGAATAATCGTTATTCAACTTATGTTTAGGAGAATATACTGCACCAACCGTCAAAGAACTC
+
DFF:BFFF?FCCCC!6B@@@EAAEBEE:EE?AA:D?=5ADAED;EE?E5EE?A?AEE-AEEEEEE->AA,C=5C::??C@BD-?D57),6<>B4B:@
@SRR061182.11 HWUSI-EAS677_102343534:3:1:1031:3041 length=100
GGTGATTATCAAGGNAGAGAGTGGGAAAGGGAAGAGCAGTCTGCTGAACTTGATATCCCGTTTTTGCGGCACGGATGGCCAGGCGGTCAGCATG
+
EDEDE?E?EB?BAC!;@;C7@,@@;4:?AADBCDCCBB?DB5AB:C?@=B=DABBBC5:DAAB:9B=?,A>A:>55><<:??5::@=???@???
nick-youngblut commented 4 years ago

Any thoughts on this issue? It's still a problem

nick-youngblut commented 4 years ago

I'm guessing the problem is resulting somewhere from within FastqReader::getRandomSeq

gunturus commented 4 years ago

@nick-youngblut I have downloaded SRR061182. It seems the 15th base is always 'N' in all the sequences. So nonpareil runs forever until it finds viable sequence in this case we have no good sequences

nick-youngblut commented 4 years ago

@gunturus thanks for checking that! Maybe it would be helpful to have a break condition in the code in which the loop breaks after a user-specified number of tries?

lmrodriguezr commented 4 years ago

Solved by c155bd134809339fe4ed994975912b6742646d3b