richarddhaigh commented 4 years ago

Hi

I'm trying to extract all bacterial reads from a paired-end kraken analysis but I am getting an error when the script tries to parse the kraken.report.txt. I'm running under most recent version of Biopython and have just updated to your latest script - the error I'm getting is:

PROGRAM START TIME: 02-06-2020 17:27:02

STEP 0: PARSING REPORT FILE //data/strepgen/JAM_EMBER_kraken/kraken_bracken_reports/EMB2.report.txt Traceback (most recent call last): File "extract_kraken_reads.py", line 395, in main() File "extract_kraken_reads.py", line 206, in main while level_num != (prev_node.level_num + 1): AttributeError: 'int' object has no attribute 'level_num'

Sorry my python is pretty poor so I can't fathom if it is a script problem or a problem in my report file - any thoughts?

Thanks

Rich

richarddhaigh commented 4 years ago

Hi Just thought I didn't include my command line in the last comment - for the first error it was

python extract_kraken_reads.py -k //data/strepgen/JAM_EMBER_kraken/EMB2.kraken -s1 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB2_L001_R1_001.fastq.gz -s2 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB2_L001_R2_001.fastq.gz -o EMB2_extract_bacterial_reads_S1.fasta -o2 EMB2_extract_bacterial_reads_S2.fasta -t 2 --include-children -r //data/strepgen/JAM_EMBER_kraken/kraken_bracken_reports/EMB2.report.txt

I have also tried another data set and dropping the "include children" and picking a single strain tax_id, i.e.

[rxh@spectre12 JAM_EMBER]$ python extract_kraken_reads.py -k //data/strepgen/JAM_EMBER_kraken/EMB3.kraken -s1 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB3_L001_R1_001.fastq.gz -s2 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB3_L001_R2_001.fastq.gz -o EMB3_extract_bacterial_reads_S1.fasta -o2 EMB3_extract_bacterial_reads_S2.fasta -t 512566

This gives a different error suggesting my tax_id isn't valid but it should be as I lifted it directly from the report (0.00 799 799 S1 512566 Streptococcus pneumoniae G54)

PROGRAM START TIME: 02-06-2020 17:45:47 1 taxonomy IDs to parse

STEP 1: PARSING KRAKEN FILE FOR READIDS //data/strepgen/JAM_EMBER_kraken/EMB3.kraken 0 reads processedTraceback (most recent call last): File "extract_kraken_reads.py", line 395, in main() File "extract_kraken_reads.py", line 266, in main [tax_id, read_id] = process_kraken_output(line) File "extract_kraken_reads.py", line 90, in process_kraken_output tax_id = int(tax_id) ValueError: invalid literal for int() with base 10: 'Homo sapiens (taxid 9606)'

Thanks

R

HumbiFred commented 4 years ago

Hi,

I am having the same problems:

`PROGRAM START TIME: 02-07-2020 08:05:33 1 taxonomy IDs to parse

STEP 1: PARSING KRAKEN FILE FOR READIDS kraken_test/KH-04601_20200205_MS2_S20_L001.kraken 0 reads processedTraceback (most recent call last): File "/home/BatchProcessFiles/extract_kraken_reads.py", line 395, in main() File "/home/BatchProcessFiles/extract_kraken_reads.py", line 266, in main [tax_id, read_id] = process_kraken_output(line) File "/home/BatchProcessFiles/extract_kraken_reads.py", line 90, in process_kraken_output tax_id = int(tax_id) ValueError: invalid literal for int() with base 10: 'Klebsiella pneumoniae (taxid 573)' ` I tried different datasets and settings, however the same error is popping up all the time.

Thank you for any advices in advance.

jenniferlu717 commented 4 years ago

For the first error, I think my newest script should fix it,

For the second error, did either of you include --use-names when running kraken?

richarddhaigh commented 4 years ago

Thanks for getting back to us so quickly.

Just checked my runs and yes I did put --use-names in the command. Is that a problem?

R

jenniferlu717 commented 4 years ago

No, I just didnt account for that in the script. I only tested on kraken files generated without that.

I just added in a couple things that should fix the problem. Please try with the script I JUST uploaded. Let me know if it still has an error?

richarddhaigh commented 4 years ago

Just downloaded new script and re-ran with original command:

python extract_kraken_reads.py -k //data/strepgen/JAM_EMBER_kraken/EMB2.kraken -s1 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB2_L001_R1_001.fastq.gz -s2 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB2_L001_R2_001.fastq.gz -o EMB2_extract_bacterial_reads_S1.fasta -o2 EMB2_extract_bacterial_reads_S2.fasta -t 2 --include-children -r //data/strepgen/JAM_EMBER_kraken/kraken_bracken_reports/EMB2.report.txt

I still get a problem but it is different this time

PROGRAM START TIME: 02-07-2020 17:32:48

STEP 0: PARSING REPORT FILE //data/strepgen/JAM_EMBER_kraken/kraken_bracken_reports/EMB2.report.txt Traceback (most recent call last): File "extract_kraken_reads.py", line 397, in main() File "extract_kraken_reads.py", line 211, in main if level_id == '-' or len(level_id) > 1: UnboundLocalError: local variable 'level_id' referenced before assignment

jenniferlu717 commented 4 years ago

agh, very very very sorry. Try the newest script now?

I'm not sure why all these bugs are in the code now....I must have changed it recently without testing properly.

richarddhaigh commented 4 years ago

No problems thanks for the help.

Just put it in again and yes now it is past the block and running step 1.

jenniferlu717 commented 4 years ago

Thank you so much. Im going to close this issue, but please open a new one if you have any more problems

richarddhaigh commented 4 years ago

Spoke too soon - it has now stopped with

PROGRAM START TIME: 02-07-2020 17:42:42 1 taxonomy IDs to parse

STEP 1: PARSING KRAKEN FILE FOR READIDS //data/strepgen/JAM_EMBER_kraken/EMB3.kraken 36.60 million reads processed 799 read IDs saved ERROR: sequence file must be FASTA or FASTQ

Doesn't seem to be the right number of ids - kraken gave me 17790 for this sample.

The fail to find the fastq could be my fault - will check my paths actually lead to the right reads files - does gz matter for fastq?

jenniferlu717 commented 4 years ago

Whats the report line for the taxid that you're trying to extract?

richarddhaigh commented 4 years ago

0.04 17780 21 D 2 Bacteria

richarddhaigh commented 4 years ago

Ah just realised I'm looking at the wrong report and it is worse than that as should be 1.7 million

4.80 1757432 8581 D 2 Bacteria

jenniferlu717 commented 4 years ago

Oh are you still running with the taxid for Strep pneumo? : (0.00 799 799 S1 512566 Streptococcus pneumoniae G54)

richarddhaigh commented 4 years ago

Ummm yes. Ooops that was dumb.

richarddhaigh commented 4 years ago

OK re-run with -t 2 for bacteria and --include-children on. The kreport for bacteria in this sample is

0.04 17780 21 D 2 Bacteria

Doesn't seem to have pulled out the expected number of read ids - looks like it didn't do the include children part - will check my command line again.

And still crashed on accessing my fastq files even though I've unzipped them - very odd as the path is definitely right.

PROGRAM START TIME: 02-07-2020 18:11:55

STEP 0: PARSING REPORT FILE //data/strepgen/JAM_EMBER_kraken/kraken_bracken_reports/EMB2.report.txt 1 taxonomy IDs to parse STEP 1: PARSING KRAKEN FILE FOR READIDS //data/strepgen/JAM_EMBER_kraken/EMB2.kraken 42.65 million reads processed 21 read IDs saved ERROR: sequence file must be FASTA or FASTQ

Thanks for the help. I will have another try tomorrow as after 6 now here.

R

jenniferlu717 commented 4 years ago

Whats the first line of your FASTQ file? That error comes up when it doesnt get a ">" or "@" as the first character.

jenniferlu717 commented 4 years ago

Also, feel free to email me your report file to jennifer.lu717@gmail.com and I can take a look at why the --include-children argument isnt working.

HumbiFred commented 4 years ago

Hi,

thx for correction, the script is starting to run, however showing another error now:

`gwdu103:5 08:33:41 /home/NGS/NGS_Run26_20200205_CampyColi > python /home/BatchProcessFiles/extract_kraken_reads.py -k kraken_test/KH-04612_20200205_MS2_S21_L001.kraken -s KH-04612_20200205_MS2_S21_L001_R1_001.fastq -s2 KH-04612_20200205_MS2_S21_L001_R2_001.fastq -o TEST.fasta -t 195 --include-children -r kraken_test/KH-04612_20200205_MS2_S21_L001 PROGRAM START TIME: 02-10-2020 07:33:55

STEP 0: PARSING REPORT FILE kraken_test/KH-04612_20200205_MS2_S21_L001 1 taxonomy IDs to parse STEP 1: PARSING KRAKEN FILE FOR READIDS kraken_test/KH-04612_20200205_MS2_S21_L001.kraken 0.61 million reads processed 124019 read IDs saved STEP 2: READING SEQUENCE FILES AND WRITING READS 0 read IDs found (610431 reads processed) 0 read IDs found (610431 reads processed) Traceback (most recent call last): File "/home/uni08/BatchProcessFiles/extract_kraken_reads.py", line 402, in main() File "/home/BatchProcessFiles/extract_kraken_reads.py", line 382, in main SeqIO.write(save_readids[i], o_file, "fasta") File "/usr/users/.conda/envs/biopython/lib/python3.8/site-packages/Bio/SeqIO/init.py", line 556, in write for record in sequences: TypeError: 'int' object is not iterable`

richarddhaigh commented 4 years ago

Hi

I've sent the kreport to your gmail account.

I now have the script running (though without --include-children working) but like HumbiFred my set-up is still having some problem with parsing the read IDs from the fastq into the fasta files.

$ python extract_kraken_reads.py -k //data/strepgen/JAM_EMBER_kraken/EMB2.kraken -s1 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB2_L001_R1_001.fastq -s2 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB2_L001_R2_001.fastq -o EMB2_extract_bacterial_reads_S1.fasta -o2 EMB2_extract_bacterial_reads_S2.fasta -t 2 --include-children -r //data/strepgen/JAM_EMBER_kraken/kraken_bracken_reports/EMB2.report.txt PROGRAM START TIME: 02-10-2020 09:43:08

STEP 0: PARSING REPORT FILE //data/strepgen/JAM_EMBER_kraken/kraken_bracken_reports/EMB2.report.txt 1 taxonomy IDs to parse STEP 1: PARSING KRAKEN FILE FOR READIDS //data/strepgen/JAM_EMBER_kraken/EMB2.kraken 42.65 million reads processed 21 read IDs saved STEP 2: READING SEQUENCE FILES AND WRITING READS 0 read IDs found (42647627 reads processed) 0 read IDs found (42647627 reads processed) Traceback (most recent call last): File "extract_kraken_reads.py", line 402, in main() File "extract_kraken_reads.py", line 382, in main SeqIO.write(save_readids[i], o_file, "fasta") File "/home/r/rxh/miniconda3/envs/reads_extract/lib/python3.8/site-packages/Bio/SeqIO/init.py", line 556, in write for record in sequences: TypeError: 'int' object is not iterable

I notice that both HumbiFred and I are using the latest Biopython 3.8 (miniconda doesn't seem to give you options for anything else) so could that be the problem?

Thanks

jenniferlu717 commented 4 years ago

I didnt get the report. Can you resend it? (jennifer.lu717@gmail.com)

Also, I don't think so....It has to do with the readIDs. Can you send me the first few lines (10 or so) of your FASTA file and your kraken file? Sometimes the readIDs don't map exactly so the program has a hard time finding them.

richarddhaigh commented 4 years ago

Hi

Have sent kreport again

head EMB2.kraken

C A00892:9:HM2WWDMXX:1:1101:2519:1031 Homo sapiens (taxid 9606) 151|151 0:6 9606:2 0:3 9606:4 0:17 9606:1 0:5 9606:1 0:1 9606:5 0:22 9606:5 0:17 9606:5 0:10 9606:1 0:12 |:| 9606:4 0:25 9606:2 0:18 9606:4 0:8 9606:3 0:21 9606:1 0:10 9606:5 0:16 C A00892:9:HM2WWDMXX:1:1101:4453:1031 Homo sapiens (taxid 9606) 151|151 0:23 9606:1 0:5 9606:5 0:13 9606:4 0:5 9606:2 0:31 9606:1 0:18 9606:2 0:7 |:| 0:10 9606:4 0:6 9606:1 0:5 9606:7 0:15 9606:4 0:33 9606:4 0:5 9606:5 0:10 9606:1 0:7 C A00892:9:HM2WWDMXX:1:1101:7129:1031 Homo sapiens (taxid 9606) 151|151 0:57 9606:2 0:58 |:| 0:105 9606:3 0:9 C A00892:9:HM2WWDMXX:1:1101:7636:1031 Homo sapiens (taxid 9606) 145|145 9606:4 0:7 9606:3 0:5 9606:5 0:9 9606:5 0:1 9606:1 0:5 9606:1 0:11 9606:1 0:5 9606:1 0:1 9606:3 0:31 9606:3 0:9 |:| 0:9 9606:3 0:31 9606:3 0:1 9606:1 0:5 9606:1 0:11 9606:1 0:5 9606:1 0:1 9606:5 0:9 9606:5 0:5 9606:3 0:7 9606:4 C A00892:9:HM2WWDMXX:1:1101:8196:1031 Homo sapiens (taxid 9606) 151|151 0:7 9606:1 0:59 9606:9 0:4 9606:2 0:14 9606:6 0:8 9606:2 0:5 |:| 9606:5 0:7 9606:1 0:2 9606:3 0:28 9606:1 0:8 9606:1 0:1 9606:8 0:6 9606:1 0:1 9606:8 0:2 9606:8 0:1 9606:4 0:2 9606:2 0:8 9606:5 0:4 C A00892:9:HM2WWDMXX:1:1101:8449:1031 Homo sapiens (taxid 9606) 151|151 0:12 9606:2 0:5 9606:7 0:3 9606:2 0:7 9606:5 0:11 9606:5 0:26 9606:7 0:6 9606:2 0:16 9606:1 |:| 9606:1 0:15 9606:4 0:36 9606:3 0:14 9606:1 0:3 9606:5 0:6 9606:4 0:10 9606:9 0:1 9606:5 C A00892:9:HM2WWDMXX:1:1101:10239:1031 Homo sapiens (taxid 9606) 151|151 0:5 9606:5 0:5 9606:3 0:3 9606:6 0:3 9606:1 0:16 9606:1 0:58 9606:5 0:1 9606:5 |:| 0:31 9606:4 0:5 9606:11 0:1 9606:5 0:58 9606:1 0:1 C A00892:9:HM2WWDMXX:1:1101:11360:1031 Homo sapiens (taxid 9606) 151|148 0:1 9606:1 0:37 9606:1 0:19 9606:1 0:3 9606:5 0:2 9606:4 0:3 9606:8 0:9 9606:5 0:18 |:| 0:5 9606:1 0:7 9606:2 0:31 9606:1 0:5 9606:4 0:24 9606:9 0:25 C A00892:9:HM2WWDMXX:1:1101:13946:1031 Homo sapiens (taxid 9606) 148|151 9606:2 0:3 9606:2 0:17 9606:5 0:2 9606:1 0:17 9606:5 0:6 9606:2 0:5 9606:9 0:1 9606:11 0:3 9606:5 0:5 9606:8 0:5 |:| 0:2 9606:1 0:5 9606:1 0:8 9606:5 0:7 9606:1 0:1 9606:5 0:2 9606:5 0:38 9606:2 0:24 9606:3 0:6 9606:1 C A00892:9:HM2WWDMXX:1:1101:14253:1031 Homo sapiens (taxid 9606) 151|151 0:1 9606:2 0:11 9606:5 0:5 9606:4 0:9 9606:5 0:17 9606:3 0:11 9606:2 0:10 9606:2 0:14 9606:8 0:8 |:| 0:13 9606:5 0:23 131567:3 0:14 9606:2 0:13 9606:5 0:39

head EMB2_R1_fastq

@A00892:9:HM2WWDMXX:1:1101:2519:1031 1:N:0:TTACCGAC+CGAATACG TGCAGGCCTGCAGGTGCTGTTCTTAGAAATCGTAGCTGTTCCCTTCTTTCTCTTTGTTTTTCTTTGGCCACTGTCCTTGGTCCCACACCTGCAGTCACAGTCAAAGGCTATGAGAACAAAGTGCCCTGCCTGTGCTGAGTGAATATGCAGG + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFFFFFFFFFFFFFFF:FF,FFFFFFFF @A00892:9:HM2WWDMXX:1:1101:4453:1031 1:N:0:TTACCGAC+CGAATACG AGTAAGTGCAAGACACAGGCATGAACATTTTCAGCTCTAGGGAAAGGCACATTCATGTTGGGGAGAGGTGGAGAATGCAGTTTGGCCAGTCTCAGAAGAGTGAGTTTTAAGGCCTAGGAAAGTGATGGAGATGATTGCTTGTTGGAGAGAG + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF::F,FF,FFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFF @A00892:9:HM2WWDMXX:1:1101:7129:1031 1:N:0:TTACCGAC+CGAATACG CGATTCCATTAGATTTTATTTGATTTGATTCGAATCGATTCGATTTGTTTCCATGCAATTCCATGCGATTCCCTTCCATTCCATTCCATTCTGCTTTATTCCATTCCAATCCACTTCACTGCACTTCACCGCATTCCATTCCATTCCATTC

Thanks

HumbiFred commented 4 years ago

Hi,

here the lines of my files:

head KH-04612_20200205_MS2_S21_L001.kraken C M04091:337:000000000-CPY2G:1:1101:13819:1829 Campylobacter coli (taxid 195) 81|81 194:10 195:25 194:5 195:7 |:| 195:8 194:5 195:25 194:9 U M04091:337:000000000-CPY2G:1:1101:15170:1834 unclassified (taxid 0) 131|131 0:97 |:| 0:97 C M04091:337:000000000-CPY2G:1:1101:16792:1849 Akkermansia muciniphila (taxid 239935) 250|248 0:30 239935:16 0:107 239935:7 0:56 |:| 0:55 641491:5 0:8 239935:5 0:141 C M04091:337:000000000-CPY2G:1:1101:14685:1862 Akkermansia (taxid 239934) 251|250 0:68 239935:5 0:138 1679444:6 |:| 0:190 239935:1 0:25 C M04091:337:000000000-CPY2G:1:1101:13886:1863 Akkermansia muciniphila (taxid 239935) 251|251 0:162 239935:5 0:50 |:| 0:100 239935:5 0:112 C M04091:337:000000000-CPY2G:1:1101:18301:1864 Akkermansia muciniphila (taxid 239935) 251|251 0:186 239935:1 0:30 |:| 0:217 U M04091:337:000000000-CPY2G:1:1101:16851:1883 unclassified (taxid 0) 251|251 0:217 |:| 0:217 U M04091:337:000000000-CPY2G:1:1101:17240:1887 unclassified (taxid 0) 85|85 0:51 |:| 0:51 U M04091:337:000000000-CPY2G:1:1101:18007:1889 unclassified (taxid 0) 250|250 0:216 |:| 0:216 C M04091:337:000000000-CPY2G:1:1101:18769:1896 Campylobacter coli (taxid 195) 251|251 0:105 195:112 |:| 195:6 194:7 195:114 0:90

FASTQ_R1 file:

@M04091:337:000000000-CPY2G:1:1101:13819:1829 1:N:0:21 GATGTACAAGGATCTTTAGAAGCCTTAAAAGCAAGCTTAGAAAAGCTTAGAAATGATGAAATTAAAGTCAATATCATACAC + 11>>AB@3DF??GGGGGFD13FEGGHBB1DFHHHHHHHHHHHHHHHHHHHGHHHHHHHFHGHHHGGHHHHGHHHHHHGHHH @M04091:337:000000000-CPY2G:1:1101:15170:1834 1:N:0:21 GGCTGTGGAAGAATTGGGCAACACCGGAATTCTCCCGGAAGAGTTGAAGGGAAAGAGCGCGAAGGAGGTTGAAGCGGCCGTGAAGGAGAAGGCGGAGGAACGGGCACGGCTGCAGAAGAAAATTGCCGATG + 33>AAFFBDFCFFGGGCFFFCGHDEE?22FGGHHHHGGEGGHHHHHHHHHGHHHHHHHGGGGFGGGHGGHHHHGHGGGGGGGGGHHHHHHHHHGGGGGGHHGGGGGGGGGGGGHHHHHGHHHHHHHHGGGG @M04091:337:000000000-CPY2G:1:1101:16792:1849 1:N:0:21 TCTCCAGAATGCGGGAGGAAGCGCCGGGAAGCACGCAGCCGATGAGATAGCCCGTCTGGGCGCAGCATTCCAGCAGGTGGCGCAGCACGCAGCCCAGCCGCGGGGCGCTGGCCTCGTCCTTGGCAAGCTGCCAGGGCTGCTGCTGTTCAATGTAGGCGTTGCAGGCGGAAACCTGGGCGTTCAGGGCCGCCAGCGCGTCGGAGACCATGTTCTCGTCCATGAATCTGGAAAACTGGACCACCGCCTGGTC

FASTQ_R2 file:

@M04091:337:000000000-CPY2G:1:1101:13819:1829 2:N:0:21 GTGTATGATATTGACTTTAATTTCATCATTTCTAAGCTTTTCTAAGCTTGCTTTTAAGGCTTCTAAAGATCCTTGTACATC + CCCCCFFFFFFFGGGGGGGGGGGHHHHGHHHHHHGHGHHHHHHHHHHHHHHHHHHHHHHFHHHHHHHIHHGHHHHHHHHHH @M04091:337:000000000-CPY2G:1:1101:15170:1834 2:N:0:21 CATCGGCAATTTTCTTCTGCAGCCGTGCCCGTTCCTCCGCCTTCTCCTTCACGGCCGCTTCAACCTCCTTCGCGCTCTTTCCCTTCAACTCTTCCGGGAGAATTCCGGTGTTGCCCAATTCTTCCACAGCC + BBCCCABBCFFFGGGGGGGGGGHHGGGGHHGGHHGHHGGGGGGHHHHHHHHHHEFGGGGGGHHGHHHHHHHHGGGGGHHHHHHHHHGHHHHHHHHGGGGGGHHHHHHCGFFGHHHHGHHHHHHHHHGHGHH @M04091:337:000000000-CPY2G:1:1101:16792:1849 2:N:0:21 CCGGAAAGGACGCCAATTTTGACGCGGACCGCCTGGTCATGCTGTATAATACGGAACTGGCCAACGATCTGGGCAACCTGTGCAACCGCTCCATCAACATGACCCGCCGTTATTGCGAGTCCGTGCTGCCCGCCCCCGGCGAATACGACGATGACGCTTCCAAGGCCCTGCGCGCAACCATGGACCAGGCGGTGGTCCAGTTTTCCAGATTCATGGACGAGAACATGGTCTCCGACGCGCTGGCGGCC

Thank you for trouble shooting!

jenniferlu717 commented 4 years ago

Okay, both --include-children and the FASTQ reading should work now. Please test?

richarddhaigh commented 4 years ago

Hi

Sorry didn't quite work - the --include-children bit seems right as it now identified 888 taxids but it didn't save any reads IDs to search.

python extract_kraken_reads_infinity.py -k //data/strepgen/JAM_EMBER_kraken/EMB2.kraken -s1 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB2_L001_R1_001.fastq -s2 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB2_L001_R2_001.fastq -o EMB2_extract_bacterial_reads_S1.fasta -o2 EMB2_extract_bacterial_reads_S2.fasta -t 2 --include-children -r //data/strepgen/JAM_EMBER_kraken/kraken_bracken_reports/EMB2.report.txt PROGRAM START TIME: 02-10-2020 17:06:29

STEP 0: PARSING REPORT FILE //data/strepgen/JAM_EMBER_kraken/kraken_bracken_reports/EMB2.report.txt 888 taxonomy IDs to parse STEP 1: PARSING KRAKEN FILE FOR READIDS //data/strepgen/JAM_EMBER_kraken/EMB2.kraken 42.65 million reads processed 0 read IDs saved STEP 2: READING SEQUENCE FILES AND WRITING READS 0 read IDs found (1 reads processed) 0 read IDs found (1 reads processed)ed) 0 reads printed to file Generated file: EMB2_extract_bacterial_reads_S1.fasta Generated file: EMB2_extract_bacterial_reads_S2.fasta PROGRAM END TIME: 02-10-2020 17:11:44

jenniferlu717 commented 4 years ago

Okay, now this might work....I had to fix the type of the taxid variables to make them consistent.

richarddhaigh commented 4 years ago

Hi

Sorry bad news. It started well and managed to identify the reads but something seems to have gone wrong writing to fasta.

python extract_kraken_reads_infinity2.py -k //data/strepgen/JAM_EMBER_kraken/EMB2.kraken -s1 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB2_L001_R1_001.fastq -s2 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB2_L001_R2_001.fastq -o EMB2_extract_bacterial_reads_S1.fasta -o2 EMB2_extract_bacterial_reads_S2.fasta -t 2 --include-children -r //data/strepgen/JAM_EMBER_kraken/kraken_bracken_reports/EMB2.report.txt PROGRAM START TIME: 02-10-2020 20:19:48

STEP 0: PARSING REPORT FILE //data/strepgen/JAM_EMBER_kraken/kraken_bracken_reports/EMB2.report.txt 888 taxonomy IDs to parse STEP 1: PARSING KRAKEN FILE FOR READIDS //data/strepgen/JAM_EMBER_kraken/EMB2.kraken 42.65 million reads processed 17780 read IDs saved STEP 2: READING SEQUENCE FILES AND WRITING READS 17780 read IDs found (42643368 reads processed) 9 read IDs found (42647627 reads processed) Traceback (most recent call last): File "extract_kraken_reads_infinity2.py", line 411, in main() File "extract_kraken_reads_infinity2.py", line 393, in main SeqIO.write(save_seqs2[i], o_file2, "fasta") KeyError: 'A00892:9:HM2WWDMXX:1:1101:12183:3740'

Thanks

richarddhaigh commented 4 years ago

Hi

Just had a look at the fasta output files - the R2 was empty but the R1 had written something

A00892:9:HM2WWDMXX:1:1101:12183:3740 GTCTGAAACTTGAAAGCTGATAATCACGCCAAAGGCGTGTGGTTATTGGCTTTTTTAGCA AAAAAATAAACGCAAGCTCTCCTGCGTTGTATACTTTAGTCGACCAAAACCAAAATACAA AGGAGACTTACGTCATGAATATTATAAGACA

HumbiFred commented 4 years ago

Hi,

same in my case:

`python /home/BatchProcessFiles/extract_kraken_reads.py -k kraken_test/KH-04612_20200205_MS2_S21_L001.kraken -s KH-04612_20200205_MS2_S21_L001_R1_001.fastq -s2 KH-04612_20200205_MS2_S21_L001_R2_001.fastq -o TEST_r1.fasta -o2 TEST_r2.fasta -t 195 --include-children -r kraken_test/KH-04612_20200205_MS2_S21_L001 PROGRAM START TIME: 02-11-2020 07:10:12

STEP 0: PARSING REPORT FILE kraken_test/KH-04612_20200205_MS2_S21_L001 6 taxonomy IDs to parse STEP 1: PARSING KRAKEN FILE FOR READIDS kraken_test/KH-04612_20200205_MS2_S21_L001.kraken 0.61 million reads processed 142676 read IDs saved STEP 2: READING SEQUENCE FILES AND WRITING READS 142676 read IDs found (610424 reads processed) 1 read IDs found (610431 reads processed) Traceback (most recent call last): File "/home/BatchProcessFiles/extract_kraken_reads.py", line 411, in main() File "/home/BatchProcessFiles/extract_kraken_reads.py", line 393, in main SeqIO.write(save_seqs2[i], o_file2, "fasta") KeyError: 'M04091:337:000000000-CPY2G:1:1101:13819:1829'`

I also have something inside the R1-File, however the R2 is empty as in richardhaigh's case:

>M04091:337:000000000-CPY2G:1:1101:13819:1829 <unknown description> GATGTACAAGGATCTTTAGAAGCCTTAAAAGCAAGCTTAGAAAAGCTTAGAAATGATGAA ATTAAAGTCAATATCATACAC

jenniferlu717 commented 4 years ago

Sorry this update took me so long to fix but I think I found the bug. Please try it again and let me know if you continue to get any other errors?

HumbiFred commented 4 years ago

Hey,

thank you very much! The basic commands are working now. However, the "inlcude-children" option is still not working:

(biopython) gwdu103:6 09:21:40 /home/NGS/NGS_Run26_20200205_CampyColi > python /home/BatchProcessFiles/extract_kraken_reads.py -k kraken_test/KH-04612_20200205_MS2_S21_L001.kraken -s KH-04612_20200205_MS2_S21_L001_R1_001.fastq -s2 KH-04612_20200205_MS2_S21_L001_R2_001.fastq -o TEST1exclude.fasta -o2 TEST2exclude.fasta -t 1783257 --exclude --include-children --report kraken_test/KH-04612_20200205_MS2_S21_L001 PROGRAM START TIME: 02-18-2020 08:57:03

STEP 0: PARSING REPORT FILE kraken_test/KH-04612_20200205_MS2_S21_L001 101 taxonomy IDs to parse STEP 1: PARSING KRAKEN FILE FOR READIDS kraken_test/KH-04612_20200205_MS2_S21_L001.kraken 0 reads processedTraceback (most recent call last): File "/home/BatchProcessFiles/extract_kraken_reads.py", line 411, in main() File "/home/BatchProcessFiles/extract_kraken_reads.py", line 283, in main save_taxids[tax_id] += 1 KeyError: 195

richarddhaigh commented 4 years ago

Hi

Thanks for the update. I'm currently locked out of the NoMachine interface for our system as IT are "fixing" something so can't test the new script just yet but will do asap.

R

jenniferlu717 commented 4 years ago

I believe the exclude option was what caused your error this time @HumbiFred. I fixed the script again....and hoping that there are no more bugs but we will see.

HumbiFred commented 4 years ago

Hey Jennifer,

now its working perfectly! Thank you so much for your effort. Very nice and handy tool!

Have a nice day

eppinglen commented 4 years ago

Hi Jennifer,

the script is running fine with the --exclude flag. However, I am getting the following error by using the latest version of your script when --exclude is not set.

/scratch1/lennard/bin/KrakenTools/extract_kraken_reads.py -k ERR1463892_reads.out --include-children -t 2 -o test1.fq -o2 test2.fq -s /scratch1/lennard/Data/Anna/ERR1463892_1.fastq.gz -s2 /scratch1/lennard/Data/Anna/ERR1463892_2.fastq.gz -r ERR1463892.report
PROGRAM START TIME: 02-19-2020 12:19:45
>> STEP 0: PARSING REPORT FILE ERR1463892.report
        198 taxonomy IDs to parse
>> STEP 1: PARSING KRAKEN FILE FOR READIDS ERR1463892_reads.out
        0 reads processedTraceback (most recent call last):
  File "/scratch1/lennard/bin/KrakenTools/extract_kraken_reads.py", line 417, in <module>
    main()
  File "/scratch1/lennard/bin/KrakenTools/extract_kraken_reads.py", line 285, in main
    elif (tax_id not in exclude_taxids) and args.exclude:
UnboundLocalError: local variable 'exclude_taxids' referenced before assignment

jenniferlu717 commented 4 years ago

Oh I hate those errors. I just added another line that should fix it @eppinglen

richarddhaigh commented 4 years ago

Hi

Latest version is running fine and --include-children is working correctly for me now. I haven't tried the -exclude function but will have a go later. It still doesn't like the fastq files being in gz but that is a minor niggle easily bypassed.

Thanks very much for sorting this out - much appreciated.

R

jenniferlu717 commented 4 years ago

Thanks for all of the testing....

@richarddhaigh , I'm not sure why its not happy with the gzipped files. When I test on mine, as long as they have the ".gz" extension, they open fine

adarmitage commented 3 years ago

I have a similar problem to the original post here. I am running the v1.0.1 release with the --include-children option following a kraken2 run with --use-names and --report-minimizer-data options.

I receive this error when running with the --include-children or --include-parents options, but it works fine without these options.

Run with error:

extract_kraken_reads.py --fastq-output --include-children -k $KrakenOut -r $KrakenReport -s $Fasta -o $OutDir/${Strain}_${Target}_${TaxaID}_pacbio.fastq -t $TaxaID

PROGRAM START TIME: 04-14-2021 14:47:57
>> STEP 0: PARSING REPORT FILE analysis/kraken2/pacbio/B.tabaci/MEAM1_BC3/MEAM1_BC3_kraken2_report.txt
Traceback (most recent call last):
  File "/mnt/beegfs/home/aa0377h/prog/kraken2/KrakenTools-1.0.1/extract_kraken_reads.py", line 433, in <module>
    main()
  File "/mnt/beegfs/home/aa0377h/prog/kraken2/KrakenTools-1.0.1/extract_kraken_reads.py", line 223, in main
    while level_num != (prev_node.level_num + 1):
AttributeError: 'int' object has no attribute 'level_num'

Successful run without --include-children or --include-parents options:

extract_kraken_reads.py --fastq-output -k $KrakenOut -r $KrakenReport -s $Fasta -o $OutDir/${Strain}_${Target}_${TaxaID}_pacbio.fastq -t $TaxaID

PROGRAM START TIME: 04-14-2021 15:03:43
        1 taxonomy IDs to parse
>> STEP 1: PARSING KRAKEN FILE FOR READIDS analysis/kraken2/pacbio/B.tabaci/MEAM1_BC3/MEAM1_BC3_kraken2_out.txt
        2.31 million reads processed
        34183 read IDs saved
>> STEP 2: READING SEQUENCE FILES AND WRITING READS
        34183 read IDs found (2.31 mill reads processed)
        34183 reads printed to file
        Generated file: qc_dna/pacbio_filtered/B.tabaci/MEAM1_BC3/R.sp./MEAM1_BC3_R.sp._1182263_pacbio.fastq
PROGRAM END TIME: 04-14-2021 15:21:45

vdnadung commented 9 months ago

Hi @jenniferlu717 and others,

I am new to bioinformatics and there was an error when running python extract_kraken_reads.py.

The message was File "extract_kraken_reads.py", line 7 !DOCTYPE html <!DOCTYPE html> ^ SyntaxError: invalid syntax

Can you please advise? Thanks in advance.

richarddhaigh commented 9 months ago

Hi vdnadung

Just a another kraken user here but I will see if I can help.

Hard to diagnose just from the error - could you give a bit more detail of what you were trying to do and exactly what the command you used was.

R

vdnadung commented 9 months ago

Hi Richard,

That would be great and much appreciated.

I use Ubuntu. From the working directory with extract_kraken_reads.py file, myfile.kraken file, read1 and read2 files, I executed python extract_kraken_reads.py -k myfile.kraken -s1 reads1.fq.gz -s2 reads2.fq.gz -o myfile.reads.fasta -t 11269

The error message popped up as above. The same error message when I just executed python extract_kraken_reads.py.

If you need any other information, please let me know.

Thank you. Dung

richarddhaigh commented 9 months ago

Hi Dung

Just a few quick observations from looking at the commands I've used successfully.

I used biopython for running this script rather than "ordinary" python - can't see anything in my notes but I guess there was a reason for this. You have two input fastq but only one output fasta I would try adding in "-o2 myfile.reads.reverse.fasta". Also I never got the script to work with fastq.gz and had to unzip them first - don't know why but obviously something in my set-up as others haven't had problems - perhaps also worth a try for for you.

Hope that helps

R

vdnadung commented 9 months ago

Hi Richard,

Thank you for your help. I have tried all your advice (loading module biopython instead of ordinary python, adding -o2 file, using .fq instead of .fq.gz files but the problem remains. Again even if I executed only "python extract_kraken_reads.py", the error message was the same. Can you please send me your extract_kraken_reads.py file to try?

Thank you and best wishes Dung

richarddhaigh commented 9 months ago

Hi Dung

This is a version jennifer re-edited for us back in Sept 2022 and is the one I was using last in october 2023 - I can see she has updated again more recently but I haven't tried that newer version.

!/usr/bin/env python

######################################################################

extract_kraken_reads.py takes in a kraken-style output and kraken report

and a taxonomy level to extract reads matching that level

#

This file is part of KrakenTools

KrakenTools is free software; oyu can redistribute it and/or modify

it under the terms of the GNU General Public License as published by

the Free Software Foundation; either version 3 of the license, or

(at your option) any later version.

#

This program is distributed in the hope that it will be useful,

but WITHOUT ANY WARRANTY; without even the implied warranty of

MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

GNU General Public License for more details.

#

You should have received a copy of the GNU General Public License

along with this program; if not, see http://www.gnu.org/licenses/.

# ######################################################################

Jennifer Lu, jlu26@jhmi.edu

Updated: 06/03/2019

#

This program extracts reads classified by Kraken as a

specified taxonomy ID. Those reads are extracted into a new FASTA file.

#

Required Parameters:

-k, --kraken, --kraken-file X.......kraken output file

-s, -s1, -1, -U X...................read file

[FASTA/FASTQ - may be gzipped]

-s2, -2, X..........................second read file if paired

[FASTA/FASTQ - may be gzipped]

-o, --output X......................output FASTA file with reads

-t, --taxid, --taxids X.............list of taxonomy IDs to extract

[separated by spaces]

-r, --report-file X.................kraken report file

[required only with --include-children/parents]

Optional Parameters:

-h, --help..........................show help message.

--max X.............................only save the first X reads found

--include-children **...............include reads classified at lower levels

--include-parents **................include reads classified at parent levels

of taxids

--append............................append extracted reads to output file if existing

--noappend..........................rewrite file if existing [default]

--exclude...........................exclude the taxids specified

** by default, only reads classified exactly at taxids provided will be extracted

** if either of these are specified, a report file must also be provided

###################################################################### import os, sys, argparse import gzip from time import gmtime from time import strftime from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord #################################################################################

Tree Class

usage: tree node used in constructing taxonomy tree

includes only taxonomy levels and genomes identified in the Kraken report

class Tree(object): 'Tree node.' def init(self, taxid, level_num, level_id, children=None, parent=None): self.taxid = taxid self.level_num = level_num self.level_id = level_id self.children = [] self.parent = parent if children is not None: for child in children: self.add_child(child) def add_child(self, node): assert isinstance(node,Tree) self.children.append(node) #################################################################################

process_kraken_output

usage: parses single line from kraken output and returns taxonomy ID and readID

input: kraken output file with readid and taxid in the

second and third tab-delimited columns

returns:

- taxonomy ID

- read ID

def process_kraken_output(kraken_line): l_vals = kraken_line.split('\t') if len(l_vals) < 5: return [-1, ''] if "taxid" in l_vals[2]: temp = l_vals[2].split("taxid ")[-1] tax_id = temp[:-1] else: tax_id = l_vals[2]

read_id = l_vals[1]
if (tax_id == 'A'):
    tax_id = 81077
else:
    tax_id = int(tax_id)
return [tax_id, read_id]

process_kraken_report

usage: parses single line from report output and returns taxID, levelID

input: kraken report file with the following tab delimited lines

- percent of total reads

- number of reads (including at lower levels)

- number of reads (only at this level)

- taxonomy classification of level

(U, - (root), - (cellular org), D, P, C, O, F, G, S)

- taxonomy ID (0 = unclassified, 1 = root, 2 = Bacteria...etc)

- spaces + name

returns:

- taxonomy ID

- level number (number of spaces before name)

- level_type (type of taxonomy level - U, R, D, P, C, O, F, G, S, etc)

def process_kraken_report(report_line): l_vals = report_line.strip().split('\t') if len(l_vals) < 5: return [] try: int(l_vals[1]) except ValueError: return []

Extract relevant information

try:
    taxid = int(l_vals[-3]) 
    level_type = l_vals[-2]
    map_kuniq = {'species':'S', 'genus':'G','family':'F',
        'order':'O','class':'C','phylum':'P','superkingdom':'D',
        'kingdom':'K'}
    if level_type not in map_kuniq:
        level_type = '-'
    else:
        level_type = map_kuniq[level_type]
except ValueError:
    taxid = int(l_vals[-2])
    level_type = l_vals[-3]
#Get spaces to determine level num
spaces = 0
for char in l_vals[-1]:
    if char == ' ':
        spaces += 1
    else:
        break
level_num = int(spaces/2)
return[taxid, level_num, level_type]

################################################################################

Main method

def main():

Parse arguments

parser = argparse.ArgumentParser()
parser.add_argument('-k', dest='kraken_file', required=True,
    help='Kraken output file to parse')
parser.add_argument('-s','-s1', '-1', '-U', dest='seq_file1', required=True,
    help='FASTA/FASTQ File containing the raw sequence letters.')
parser.add_argument('-s2', '-2', dest='seq_file2', default= "",
    help='2nd FASTA/FASTQ File containing the raw sequence letters (paired).')
parser.add_argument('-t', "--taxid",dest='taxid', required=True,
    nargs='+',
    help='Taxonomy ID[s] of reads to extract (space-delimited)')
parser.add_argument('-o', "--output",dest='output_file', required=True,
    help='Output FASTA/Q file containing the reads and sample IDs')
parser.add_argument('-o2',"--output2", dest='output_file2', required=False, default='',
    help='Output FASTA/Q file containig the second pair of reads [required for paired input]') 
parser.add_argument('--append', dest='append', action='store_true',
    help='Append the sequences to the end of the output FASTA file specified.')
parser.add_argument('--noappend', dest='append', action='store_false',
    help='Create a new FASTA file containing sample sequences and IDs \
          (rewrite if existing) [default].')
parser.add_argument('--max', dest='max_reads', required=False, 
    default=100000000, type=int,
    help='Maximum number of reads to save [default: 100,000,000]')
parser.add_argument('-r','--report',dest='report_file', required=False,
    default="",
    help='Kraken report file. [required only if --include-parents/children \
    is specified]')
parser.add_argument('--include-parents',dest="parents", required=False, 
    action='store_true',default=False,
    help='Include reads classified at parent levels of the specified taxids')
parser.add_argument('--include-children',dest='children', required=False,
    action='store_true',default=False,
    help='Include reads classified more specifically than the specified taxids')
parser.add_argument('--exclude', dest='exclude', required=False,
    action='store_true',default=False,
    help='Instead of finding reads matching specified taxids, finds all reads NOT matching specified taxids') 
parser.add_argument('--fastq-output', dest='fastq_out', required=False,
    action='store_true',default=False,
    help='Print output FASTQ reads [requires input FASTQ, default: output is FASTA]')
parser.set_defaults(append=False)

args=parser.parse_args()

#Start Program
time = strftime("%m-%d-%Y %H:%M:%S", gmtime())
sys.stdout.write("PROGRAM START TIME: " + time + '\n')

#Check input 
if (len(args.output_file2) == 0) and (len(args.seq_file2) > 0):
    sys.stderr.write("Must specify second output file -o2 for paired input\n")
    sys.exit(1)

#Initialize taxids
save_taxids = {}
for tid in args.taxid:
    save_taxids[int(tid)] = 0
main_lvls = ['R','K','D','P','C','O','F','G','S']

#STEP 0: READ IN REPORT FILE AND GET ALL TAXIDS 
if args.parents or args.children:
    #check that report file exists
    if args.report_file == "": 
        sys.stderr.write(">> ERROR: --report not specified.")
        sys.exit(1)
    sys.stdout.write(">> STEP 0: PARSING REPORT FILE %s\n" % args.report_file)
    #create tree and save nodes with taxids in the list 
    base_nodes = {} 
    r_file = open(args.report_file,'r')
    prev_node = -1
    for line in r_file:
        #extract values
        report_vals = process_kraken_report(line)
        if len(report_vals) == 0:
            continue
        [taxid, level_num, level_id] = report_vals
        if taxid == 0:
            continue 
        #tree root
        if taxid == 1:
            level_id = 'R'
            root_node = Tree(taxid, level_num, level_id)
            prev_node = root_node
            #save if needed
            if taxid in save_taxids:
                base_nodes[taxid] = root_node
            continue
        #move to correct parent
        while level_num != (prev_node.level_num + 1):
            prev_node = prev_node.parent 
        #determine correct level ID 
        if level_id == '-' or len(level_id) > 1:
            if prev_node.level_id in main_lvls:
                level_id = prev_node.level_id + '1'
            else:
                num = int(prev_node.level_id[-1]) + 1
                level_id = prev_node.level_id[:-1] + str(num)
        #make node
        curr_node = Tree(taxid, level_num, level_id, None, prev_node)
        prev_node.add_child(curr_node)
        prev_node = curr_node
        #save if taxid matches
        if taxid in save_taxids:
            base_nodes[taxid] = curr_node 
    r_file.close()
    #FOR SAVING PARENTS
    if args.parents:
        #For each node saved, traverse up the tree and save each taxid 
        for tid in base_nodes:
            curr_node = base_nodes[tid]
            while curr_node.parent != None:
                curr_node = curr_node.parent
                save_taxids[curr_node.taxid] = 0
    #FOR SAVING CHILDREN 
    if args.children:
        for tid in base_nodes:
            curr_nodes = base_nodes[tid].children
            while len(curr_nodes) > 0:
                #For this node
                curr_n = curr_nodes.pop()
                if curr_n.taxid not in save_taxids:
                    save_taxids[curr_n.taxid] = 0
                #Add all children
                if curr_n.children != None:
                    for child in curr_n.children:
                        curr_nodes.append(child)

##############################################################################
sys.stdout.write("\t%i taxonomy IDs to parse\n" % len(save_taxids))
sys.stdout.write(">> STEP 1: PARSING KRAKEN FILE FOR READIDS %s\n" % args.kraken_file)
#Initialize values
count_kraken = 0
read_line = -1
exclude_taxids = {} 
if args.exclude:
    exclude_taxids = save_taxids 
    save_taxids = {} 
#PROCESS KRAKEN FILE FOR CLASSIFIED READ IDS
k_file = open(args.kraken_file, 'r')
sys.stdout.write('\t0 reads processed')
sys.stdout.flush()
#Evaluate each sample in the kraken file
save_readids = {}
save_readids2 = {} 
for line in k_file:
    count_kraken += 1
    if (count_kraken % 10000 == 0):
        sys.stdout.write('\r\t%0.2f million reads processed' % float(count_kraken/1000000.))
        sys.stdout.flush()
    #Parse line for results
    [tax_id, read_id] = process_kraken_output(line)
    if tax_id == -1:
        continue
    #Skip if reads are human/artificial/synthetic
    if (tax_id in save_taxids) and not args.exclude:
        save_taxids[tax_id] += 1
        save_readids2[read_id] = 0
        save_readids[read_id] = 0 
    elif (tax_id not in exclude_taxids) and args.exclude:
        if tax_id not in save_taxids:
            save_taxids[tax_id] = 1
        else:
            save_taxids[tax_id] += 1
        save_readids2[read_id] = 0
        save_readids[read_id] = 0 
    if len(save_readids) >= args.max_reads:
        break 
#Update user
k_file.close()
sys.stdout.write('\r\t%0.2f million reads processed\n' % float(count_kraken/1000000.))
sys.stdout.write('\t%i read IDs saved\n' % len(save_readids))
##############################################################################
#Sequence files
seq_file1 = args.seq_file1
seq_file2 = args.seq_file2
####TEST IF INPUT IS FASTA OR FASTQ
if(seq_file1[-3:] == '.gz'):
    s_file1 = gzip.open(seq_file1,'rt')
else:
    s_file1 = open(seq_file1,'rt')
first = s_file1.readline()
if len(first) == 0:
    sys.stderr.write("ERROR: sequence file's first line is blank\n")
    sys.exit(1)
if first[0] == ">":
    filetype = "fasta"
elif first[0] == "@":
    filetype = "fastq"
else:
    sys.stderr.write("ERROR: sequence file must be FASTA or FASTQ\n")
    sys.exit(1)
s_file1.close()
if filetype != 'fastq' and args.fastq_out:
    sys.stderr.write('ERROR: for FASTQ output, input file must be FASTQ\n')
    sys.exit(1)
####ACTUALLY OPEN FILE
if(seq_file1[-3:] == '.gz'):
    #Zipped Sequence Files
    s_file1 = gzip.open(seq_file1,'rt')
    if len(seq_file2) > 0:
        s_file2 = gzip.open(seq_file2,'rt')
else:
    s_file1 = open(seq_file1, 'r')
    if len(seq_file2) > 0:
        s_file2 = open(seq_file2, 'r')
#PROCESS INPUT FILE AND SAVE FASTA FILE
sys.stdout.write(">> STEP 2: READING SEQUENCE FILES AND WRITING READS\n")
sys.stdout.write('\t0 read IDs found (0 mill reads processed)')
sys.stdout.flush()
#Open output file
if (args.append):
    o_file = open(args.output_file, 'a')
    if args.output_file2 != '':
        o_file2 = open(args.output_file2, 'a')
else:
    o_file = open(args.output_file, 'w')
    if args.output_file2 != '':
        o_file2 = open(args.output_file2, 'w')
#Process SEQUENCE 1 file 
count_seqs = 0
count_output = 0
for record in SeqIO.parse(s_file1,filetype):
    count_seqs += 1
    #Print update
    if (count_seqs % 1000 == 0):
        sys.stdout.write('\r\t%i read IDs found (%0.2f mill reads processed)' % (count_output, float(count_seqs/1000000.)))
        sys.stdout.flush()
    #Check ID 
    test_id = str(record.id)
    test_id2 = test_id
    if ("/1" in test_id) or ("/2" in test_id):
        test_id2 = test_id[:-2]
    #Sequence found
    if test_id in save_readids or test_id2 in save_readids:
        count_output += 1
        #Print update
        sys.stdout.write('\r\t%i read IDs found (%0.2f mill reads processed)' % (count_output, float(count_seqs/1000000.)))
        sys.stdout.flush()
        #Save to file
        if args.fastq_out:
            SeqIO.write(record, o_file, "fastq")
        else:
            SeqIO.write(record, o_file, "fasta")
    #If no more reads to find 
    if len(save_readids) == count_output:
        break
#Close files
s_file1.close()
o_file.close()
sys.stdout.write('\r\t%i read IDs found (%0.2f mill reads processed)\n' % (count_output, float(count_seqs/1000000.)))
sys.stdout.flush()
if len(seq_file2) > 0:
    count_output = 0
    count_seqs = 0
    sys.stdout.write('\t%i read IDs found (%0.2f mill reads processed)' % (count_output, float(count_seqs/1000000.)))
    sys.stdout.flush()
    for record in SeqIO.parse(s_file2, filetype):
        count_seqs += 1
        #Print update
        if (count_seqs % 1000 == 0):
            sys.stdout.write('\r\t%i read IDs found (%0.2f mill reads processed)' % (count_output, float(count_seqs/1000000.)))
            sys.stdout.flush()
        test_id = str(record.id)
        test_id2 = test_id
        if ("/1" in test_id) or ("/2" in test_id):
            test_id2 = test_id[:-2]
        #Sequence found
        if test_id in save_readids or test_id2 in save_readids:
            count_output += 1
            sys.stdout.write('\r\t%i read IDs found (%0.2f mill reads processed)' % (count_output, float(count_seqs/1000000.)))
            sys.stdout.flush()
            #Save to file
            if args.fastq_out:
                SeqIO.write(record, o_file2, "fastq")
            else:
                SeqIO.write(record, o_file2, "fasta")
        #If no more reads to find 
        if len(save_readids) == count_output:
            break
    s_file2.close()
    o_file2.close()
    #End Program
    sys.stdout.write('\r\t%i read IDs found (%0.2f mill reads processed)\n' % (count_output, float(count_seqs/1000000.)))

#End Program
sys.stdout.write('\t' + str(count_output) + ' reads printed to file\n')
sys.stdout.write('\tGenerated file: %s\n' % args.output_file)
if args.output_file2 != '':
    sys.stdout.write('\tGenerated file: %s\n' % args.output_file2)

#End of program
time = strftime("%m-%d-%Y %H:%M:%S", gmtime())
sys.stdout.write("PROGRAM END TIME: " + time + '\n')
sys.exit(0)

#################################################################################

if name == "main": main()

################################################################################# #################################END OF PROGRAM################################## #################################################################################

vdnadung commented 9 months ago

Hi Richard,

Thanks very much. It works now.

To get the extract_kraken_reads.py, I followed a document online: right click on the extract_kraken_reads.py file then "save as". From the text you sent, I realised it's different from the extract_kraken_reads.py I saved which looks as below. This explains the error message.

And you are correct that it seems to require biopython.

Many thanks for your help. Best wishes Dung

jenniferlu717 / KrakenTools

error when using extract_kraken_reads.py #7

!/usr/bin/env python

extract_kraken_reads.py takes in a kraken-style output and kraken report

and a taxonomy level to extract reads matching that level

Copyright (C) 2019-2020 Jennifer Lu, jennifer.lu717@gmail.com

This file is part of KrakenTools

KrakenTools is free software; oyu can redistribute it and/or modify

it under the terms of the GNU General Public License as published by

the Free Software Foundation; either version 3 of the license, or

(at your option) any later version.

This program is distributed in the hope that it will be useful,

but WITHOUT ANY WARRANTY; without even the implied warranty of

MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

GNU General Public License for more details.

You should have received a copy of the GNU General Public License

along with this program; if not, see http://www.gnu.org/licenses/.

Jennifer Lu, jlu26@jhmi.edu

Updated: 06/03/2019

This program extracts reads classified by Kraken as a

specified taxonomy ID. Those reads are extracted into a new FASTA file.

Required Parameters:

-k, --kraken, --kraken-file X.......kraken output file

-s, -s1, -1, -U X...................read file

[FASTA/FASTQ - may be gzipped]

-s2, -2, X..........................second read file if paired

[FASTA/FASTQ - may be gzipped]

-o, --output X......................output FASTA file with reads

-t, --taxid, --taxids X.............list of taxonomy IDs to extract

[separated by spaces]

-r, --report-file X.................kraken report file

[required only with --include-children/parents]

Optional Parameters:

-h, --help..........................show help message.

--max X.............................only save the first X reads found

--include-children **...............include reads classified at lower levels

--include-parents **................include reads classified at parent levels

of taxids

--append............................append extracted reads to output file if existing

--noappend..........................rewrite file if existing [default]

--exclude...........................exclude the taxids specified

** by default, only reads classified exactly at taxids provided will be extracted

** if either of these are specified, a report file must also be provided

Tree Class

usage: tree node used in constructing taxonomy tree

includes only taxonomy levels and genomes identified in the Kraken report

process_kraken_output

usage: parses single line from kraken output and returns taxonomy ID and readID

input: kraken output file with readid and taxid in the

second and third tab-delimited columns

returns:

- taxonomy ID

- read ID

process_kraken_report

usage: parses single line from report output and returns taxID, levelID

input: kraken report file with the following tab delimited lines

- percent of total reads

- number of reads (including at lower levels)

- number of reads (only at this level)

- taxonomy classification of level

(U, - (root), - (cellular org), D, P, C, O, F, G, S)

- taxonomy ID (0 = unclassified, 1 = root, 2 = Bacteria...etc)

- spaces + name

returns:

- taxonomy ID

- level number (number of spaces before name)

- level_type (type of taxonomy level - U, R, D, P, C, O, F, G, S, etc)

Extract relevant information

Main method

Parse arguments