Open sconlan opened 3 years ago
After looking at my input files and the extract_kraken_reads.py code I figured out this issue. My preprocessing code failed to append the /1 and /2 tokens to the read names so I ended up with the forward and reverse reads sharing a non-unique name in my interleaved file. That's my bad.
That still didn't explain why I was getting that specific count (102). I think it is because of the condition that has been set to terminate the loop:
if len(save_readids) == count_output:
break
When the script reaches the number of expected unique read names (102) it terminates, regardless of whether there are additional reads in the fastq that match save_readids
. I'm not sure if there's an easy way to alert users that they've made this error... You could replace the above with something like:
if len(save_readids)+1 == count_output:
sys.stdout.write('\tWARNING: input fastq has non-unique read names, did you forget /1|/2?')
That would throw a warning and continue grabbing reads to the end (costing a little extra time on some runs). A better answer might be to figure this out at the beginning and warn the user that they did something boneheaded.
btw: I get the same issue if I use -s or -U to designate the interleaved input fastq instead of -s1.
I am trying to extract the reads from a single taxid at that level (no parents or children). The hierarchy looks like:
I would expect that there are 135 reads associated with taxid 11128. That's how many are in the read assignment report too:
However, when I try to extract those reads I get:
Any idea why I'm getting 102 instead of 135? I get the same error when extracting reads from inner nodes as well but I thought this was a simpler example. I don't see an obvious version flag but I cloned the repo in June of this year.