jenniferlu717 / KrakenTools

KrakenTools provides individual scripts to analyze Kraken/Kraken2/Bracken/KrakenUniq output files
GNU General Public License v3.0
297 stars 85 forks source link

Read count doesn't match expected from extract_kraken_reads.py #39

Open sconlan opened 3 years ago

sconlan commented 3 years ago

I am trying to extract the reads from a single taxid at that level (no parents or children). The hierarchy looks like:

  0.01  1898    0       P       2732408         Pisuviricota
  0.01  1898    0       C       2732506           Pisoniviricetes
  0.01  1898    0       O       76804               Nidovirales
  0.01  1898    0       O1      2499399               Cornidovirineae
  0.01  1898    0       F       11118                   Coronaviridae
  0.01  1898    0       F1      2501931                   Orthocoronavirinae
  0.01  1898    0       G       694002                      Betacoronavirus
  0.01  1898    12      G1      2509481                       Embecovirus
  0.01  1886    212     S       694003                          Betacoronavirus 1
  0.01  1539    1539    S1      31631                             Human coronavirus OC43
  0.00  135     135     S1      11128                             Bovine coronavirus <--- I WANT THIS ONE

I would expect that there are 135 reads associated with taxid 11128. That's how many are in the read assignment report too:

17:29 cn0896 classify$ grep -P '\s+11128\s+' VRNA_015.txt | wc -l
135

However, when I try to extract those reads I get:

17:29 cn0896 classify$ /data/Segrelab/bwbin/KrakenTools/extract_kraken_reads.py -k VRNA_015.txt --fastq-output -s1 ../VRNA_015.qc.nohuman.fastq.gz -o VRNA_015.only_11128.fastq -t 11128
PROGRAM START TIME: 06-28-2021 21:31:38
        1 taxonomy IDs to parse
>> STEP 1: PARSING KRAKEN FILE FOR READIDS VRNA_015.txt
        16.09 million reads processed
        102 read IDs saved
>> STEP 2: READING SEQUENCE FILES AND WRITING READS
        102 read IDs found (7.04 mill reads processed)
        102 reads printed to file
        Generated file: VRNA_015.only_11128.fastq
PROGRAM END TIME: 06-28-2021 21:35:42
17:35 cn0896 classify$ grep -c '^' VRNA_015.only_11128.fastq | div4.pl 
102

Any idea why I'm getting 102 instead of 135? I get the same error when extracting reads from inner nodes as well but I thought this was a simpler example. I don't see an obvious version flag but I cloned the repo in June of this year.

sconlan commented 3 years ago

After looking at my input files and the extract_kraken_reads.py code I figured out this issue. My preprocessing code failed to append the /1 and /2 tokens to the read names so I ended up with the forward and reverse reads sharing a non-unique name in my interleaved file. That's my bad.

That still didn't explain why I was getting that specific count (102). I think it is because of the condition that has been set to terminate the loop:

        if len(save_readids) == count_output:
            break

When the script reaches the number of expected unique read names (102) it terminates, regardless of whether there are additional reads in the fastq that match save_readids . I'm not sure if there's an easy way to alert users that they've made this error... You could replace the above with something like:

       if len(save_readids)+1 == count_output:
            sys.stdout.write('\tWARNING: input fastq has non-unique read names, did you forget /1|/2?')

That would throw a warning and continue grabbing reads to the end (costing a little extra time on some runs). A better answer might be to figure this out at the beginning and warn the user that they did something boneheaded.

btw: I get the same issue if I use -s or -U to designate the interleaved input fastq instead of -s1.