Closed richarddhaigh closed 4 years ago
Hi Just thought I didn't include my command line in the last comment - for the first error it was
python extract_kraken_reads.py -k //data/strepgen/JAM_EMBER_kraken/EMB2.kraken -s1 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB2_L001_R1_001.fastq.gz -s2 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB2_L001_R2_001.fastq.gz -o EMB2_extract_bacterial_reads_S1.fasta -o2 EMB2_extract_bacterial_reads_S2.fasta -t 2 --include-children -r //data/strepgen/JAM_EMBER_kraken/kraken_bracken_reports/EMB2.report.txt
I have also tried another data set and dropping the "include children" and picking a single strain tax_id, i.e.
[rxh@spectre12 JAM_EMBER]$ python extract_kraken_reads.py -k //data/strepgen/JAM_EMBER_kraken/EMB3.kraken -s1 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB3_L001_R1_001.fastq.gz -s2 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB3_L001_R2_001.fastq.gz -o EMB3_extract_bacterial_reads_S1.fasta -o2 EMB3_extract_bacterial_reads_S2.fasta -t 512566
This gives a different error suggesting my tax_id isn't valid but it should be as I lifted it directly from the report (0.00 799 799 S1 512566 Streptococcus pneumoniae G54)
PROGRAM START TIME: 02-06-2020 17:45:47 1 taxonomy IDs to parse
STEP 1: PARSING KRAKEN FILE FOR READIDS //data/strepgen/JAM_EMBER_kraken/EMB3.kraken 0 reads processedTraceback (most recent call last): File "extract_kraken_reads.py", line 395, in
main() File "extract_kraken_reads.py", line 266, in main [tax_id, read_id] = process_kraken_output(line) File "extract_kraken_reads.py", line 90, in process_kraken_output tax_id = int(tax_id) ValueError: invalid literal for int() with base 10: 'Homo sapiens (taxid 9606)'
Thanks
R
Hi,
I am having the same problems:
`PROGRAM START TIME: 02-07-2020 08:05:33 1 taxonomy IDs to parse
STEP 1: PARSING KRAKEN FILE FOR READIDS kraken_test/KH-04601_20200205_MS2_S20_L001.kraken 0 reads processedTraceback (most recent call last): File "/home/BatchProcessFiles/extract_kraken_reads.py", line 395, in
main() File "/home/BatchProcessFiles/extract_kraken_reads.py", line 266, in main [tax_id, read_id] = process_kraken_output(line) File "/home/BatchProcessFiles/extract_kraken_reads.py", line 90, in process_kraken_output tax_id = int(tax_id) ValueError: invalid literal for int() with base 10: 'Klebsiella pneumoniae (taxid 573)' ` I tried different datasets and settings, however the same error is popping up all the time.
Thank you for any advices in advance.
For the first error, I think my newest script should fix it,
For the second error, did either of you include --use-names when running kraken?
Thanks for getting back to us so quickly.
Just checked my runs and yes I did put --use-names in the command. Is that a problem?
R
No, I just didnt account for that in the script. I only tested on kraken files generated without that.
I just added in a couple things that should fix the problem. Please try with the script I JUST uploaded. Let me know if it still has an error?
Just downloaded new script and re-ran with original command:
python extract_kraken_reads.py -k //data/strepgen/JAM_EMBER_kraken/EMB2.kraken -s1 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB2_L001_R1_001.fastq.gz -s2 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB2_L001_R2_001.fastq.gz -o EMB2_extract_bacterial_reads_S1.fasta -o2 EMB2_extract_bacterial_reads_S2.fasta -t 2 --include-children -r //data/strepgen/JAM_EMBER_kraken/kraken_bracken_reports/EMB2.report.txt
I still get a problem but it is different this time
PROGRAM START TIME: 02-07-2020 17:32:48
STEP 0: PARSING REPORT FILE //data/strepgen/JAM_EMBER_kraken/kraken_bracken_reports/EMB2.report.txt Traceback (most recent call last): File "extract_kraken_reads.py", line 397, in
main() File "extract_kraken_reads.py", line 211, in main if level_id == '-' or len(level_id) > 1: UnboundLocalError: local variable 'level_id' referenced before assignment
agh, very very very sorry. Try the newest script now?
I'm not sure why all these bugs are in the code now....I must have changed it recently without testing properly.
No problems thanks for the help.
Just put it in again and yes now it is past the block and running step 1.
Thank you so much. Im going to close this issue, but please open a new one if you have any more problems
Spoke too soon - it has now stopped with
PROGRAM START TIME: 02-07-2020 17:42:42 1 taxonomy IDs to parse
STEP 1: PARSING KRAKEN FILE FOR READIDS //data/strepgen/JAM_EMBER_kraken/EMB3.kraken 36.60 million reads processed 799 read IDs saved ERROR: sequence file must be FASTA or FASTQ
Doesn't seem to be the right number of ids - kraken gave me 17790 for this sample.
The fail to find the fastq could be my fault - will check my paths actually lead to the right reads files - does gz matter for fastq?
Whats the report line for the taxid that you're trying to extract?
0.04 17780 21 D 2 Bacteria
Ah just realised I'm looking at the wrong report and it is worse than that as should be 1.7 million
4.80 1757432 8581 D 2 Bacteria
Oh are you still running with the taxid for Strep pneumo? : (0.00 799 799 S1 512566 Streptococcus pneumoniae G54)
Ummm yes. Ooops that was dumb.
OK re-run with -t 2 for bacteria and --include-children on. The kreport for bacteria in this sample is
0.04 17780 21 D 2 Bacteria
Doesn't seem to have pulled out the expected number of read ids - looks like it didn't do the include children part - will check my command line again.
And still crashed on accessing my fastq files even though I've unzipped them - very odd as the path is definitely right.
PROGRAM START TIME: 02-07-2020 18:11:55
STEP 0: PARSING REPORT FILE //data/strepgen/JAM_EMBER_kraken/kraken_bracken_reports/EMB2.report.txt 1 taxonomy IDs to parse STEP 1: PARSING KRAKEN FILE FOR READIDS //data/strepgen/JAM_EMBER_kraken/EMB2.kraken 42.65 million reads processed 21 read IDs saved ERROR: sequence file must be FASTA or FASTQ
Thanks for the help. I will have another try tomorrow as after 6 now here.
R
Whats the first line of your FASTQ file? That error comes up when it doesnt get a ">" or "@" as the first character.
Also, feel free to email me your report file to jennifer.lu717@gmail.com and I can take a look at why the --include-children argument isnt working.
Hi,
thx for correction, the script is starting to run, however showing another error now:
`gwdu103:5 08:33:41 /home/NGS/NGS_Run26_20200205_CampyColi > python /home/BatchProcessFiles/extract_kraken_reads.py -k kraken_test/KH-04612_20200205_MS2_S21_L001.kraken -s KH-04612_20200205_MS2_S21_L001_R1_001.fastq -s2 KH-04612_20200205_MS2_S21_L001_R2_001.fastq -o TEST.fasta -t 195 --include-children -r kraken_test/KH-04612_20200205_MS2_S21_L001 PROGRAM START TIME: 02-10-2020 07:33:55
STEP 0: PARSING REPORT FILE kraken_test/KH-04612_20200205_MS2_S21_L001 1 taxonomy IDs to parse STEP 1: PARSING KRAKEN FILE FOR READIDS kraken_test/KH-04612_20200205_MS2_S21_L001.kraken 0.61 million reads processed 124019 read IDs saved STEP 2: READING SEQUENCE FILES AND WRITING READS 0 read IDs found (610431 reads processed) 0 read IDs found (610431 reads processed) Traceback (most recent call last): File "/home/uni08/BatchProcessFiles/extract_kraken_reads.py", line 402, in
main() File "/home/BatchProcessFiles/extract_kraken_reads.py", line 382, in main SeqIO.write(save_readids[i], o_file, "fasta") File "/usr/users/.conda/envs/biopython/lib/python3.8/site-packages/Bio/SeqIO/init.py", line 556, in write for record in sequences: TypeError: 'int' object is not iterable`
Hi
I've sent the kreport to your gmail account.
I now have the script running (though without --include-children working) but like HumbiFred my set-up is still having some problem with parsing the read IDs from the fastq into the fasta files.
$ python extract_kraken_reads.py -k //data/strepgen/JAM_EMBER_kraken/EMB2.kraken -s1 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB2_L001_R1_001.fastq -s2 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB2_L001_R2_001.fastq -o EMB2_extract_bacterial_reads_S1.fasta -o2 EMB2_extract_bacterial_reads_S2.fasta -t 2 --include-children -r //data/strepgen/JAM_EMBER_kraken/kraken_bracken_reports/EMB2.report.txt PROGRAM START TIME: 02-10-2020 09:43:08
STEP 0: PARSING REPORT FILE //data/strepgen/JAM_EMBER_kraken/kraken_bracken_reports/EMB2.report.txt 1 taxonomy IDs to parse STEP 1: PARSING KRAKEN FILE FOR READIDS //data/strepgen/JAM_EMBER_kraken/EMB2.kraken 42.65 million reads processed 21 read IDs saved STEP 2: READING SEQUENCE FILES AND WRITING READS 0 read IDs found (42647627 reads processed) 0 read IDs found (42647627 reads processed) Traceback (most recent call last): File "extract_kraken_reads.py", line 402, in
main() File "extract_kraken_reads.py", line 382, in main SeqIO.write(save_readids[i], o_file, "fasta") File "/home/r/rxh/miniconda3/envs/reads_extract/lib/python3.8/site-packages/Bio/SeqIO/init.py", line 556, in write for record in sequences: TypeError: 'int' object is not iterable
I notice that both HumbiFred and I are using the latest Biopython 3.8 (miniconda doesn't seem to give you options for anything else) so could that be the problem?
Thanks
I didnt get the report. Can you resend it? (jennifer.lu717@gmail.com)
Also, I don't think so....It has to do with the readIDs. Can you send me the first few lines (10 or so) of your FASTA file and your kraken file? Sometimes the readIDs don't map exactly so the program has a hard time finding them.
Hi
Have sent kreport again
head EMB2.kraken
C A00892:9:HM2WWDMXX:1:1101:2519:1031 Homo sapiens (taxid 9606) 151|151 0:6 9606:2 0:3 9606:4 0:17 9606:1 0:5 9606:1 0:1 9606:5 0:22 9606:5 0:17 9606:5 0:10 9606:1 0:12 |:| 9606:4 0:25 9606:2 0:18 9606:4 0:8 9606:3 0:21 9606:1 0:10 9606:5 0:16 C A00892:9:HM2WWDMXX:1:1101:4453:1031 Homo sapiens (taxid 9606) 151|151 0:23 9606:1 0:5 9606:5 0:13 9606:4 0:5 9606:2 0:31 9606:1 0:18 9606:2 0:7 |:| 0:10 9606:4 0:6 9606:1 0:5 9606:7 0:15 9606:4 0:33 9606:4 0:5 9606:5 0:10 9606:1 0:7 C A00892:9:HM2WWDMXX:1:1101:7129:1031 Homo sapiens (taxid 9606) 151|151 0:57 9606:2 0:58 |:| 0:105 9606:3 0:9 C A00892:9:HM2WWDMXX:1:1101:7636:1031 Homo sapiens (taxid 9606) 145|145 9606:4 0:7 9606:3 0:5 9606:5 0:9 9606:5 0:1 9606:1 0:5 9606:1 0:11 9606:1 0:5 9606:1 0:1 9606:3 0:31 9606:3 0:9 |:| 0:9 9606:3 0:31 9606:3 0:1 9606:1 0:5 9606:1 0:11 9606:1 0:5 9606:1 0:1 9606:5 0:9 9606:5 0:5 9606:3 0:7 9606:4 C A00892:9:HM2WWDMXX:1:1101:8196:1031 Homo sapiens (taxid 9606) 151|151 0:7 9606:1 0:59 9606:9 0:4 9606:2 0:14 9606:6 0:8 9606:2 0:5 |:| 9606:5 0:7 9606:1 0:2 9606:3 0:28 9606:1 0:8 9606:1 0:1 9606:8 0:6 9606:1 0:1 9606:8 0:2 9606:8 0:1 9606:4 0:2 9606:2 0:8 9606:5 0:4 C A00892:9:HM2WWDMXX:1:1101:8449:1031 Homo sapiens (taxid 9606) 151|151 0:12 9606:2 0:5 9606:7 0:3 9606:2 0:7 9606:5 0:11 9606:5 0:26 9606:7 0:6 9606:2 0:16 9606:1 |:| 9606:1 0:15 9606:4 0:36 9606:3 0:14 9606:1 0:3 9606:5 0:6 9606:4 0:10 9606:9 0:1 9606:5 C A00892:9:HM2WWDMXX:1:1101:10239:1031 Homo sapiens (taxid 9606) 151|151 0:5 9606:5 0:5 9606:3 0:3 9606:6 0:3 9606:1 0:16 9606:1 0:58 9606:5 0:1 9606:5 |:| 0:31 9606:4 0:5 9606:11 0:1 9606:5 0:58 9606:1 0:1 C A00892:9:HM2WWDMXX:1:1101:11360:1031 Homo sapiens (taxid 9606) 151|148 0:1 9606:1 0:37 9606:1 0:19 9606:1 0:3 9606:5 0:2 9606:4 0:3 9606:8 0:9 9606:5 0:18 |:| 0:5 9606:1 0:7 9606:2 0:31 9606:1 0:5 9606:4 0:24 9606:9 0:25 C A00892:9:HM2WWDMXX:1:1101:13946:1031 Homo sapiens (taxid 9606) 148|151 9606:2 0:3 9606:2 0:17 9606:5 0:2 9606:1 0:17 9606:5 0:6 9606:2 0:5 9606:9 0:1 9606:11 0:3 9606:5 0:5 9606:8 0:5 |:| 0:2 9606:1 0:5 9606:1 0:8 9606:5 0:7 9606:1 0:1 9606:5 0:2 9606:5 0:38 9606:2 0:24 9606:3 0:6 9606:1 C A00892:9:HM2WWDMXX:1:1101:14253:1031 Homo sapiens (taxid 9606) 151|151 0:1 9606:2 0:11 9606:5 0:5 9606:4 0:9 9606:5 0:17 9606:3 0:11 9606:2 0:10 9606:2 0:14 9606:8 0:8 |:| 0:13 9606:5 0:23 131567:3 0:14 9606:2 0:13 9606:5 0:39
head EMB2_R1_fastq
@A00892:9:HM2WWDMXX:1:1101:2519:1031 1:N:0:TTACCGAC+CGAATACG TGCAGGCCTGCAGGTGCTGTTCTTAGAAATCGTAGCTGTTCCCTTCTTTCTCTTTGTTTTTCTTTGGCCACTGTCCTTGGTCCCACACCTGCAGTCACAGTCAAAGGCTATGAGAACAAAGTGCCCTGCCTGTGCTGAGTGAATATGCAGG + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFFFFFFFFFFFFFFF:FF,FFFFFFFF @A00892:9:HM2WWDMXX:1:1101:4453:1031 1:N:0:TTACCGAC+CGAATACG AGTAAGTGCAAGACACAGGCATGAACATTTTCAGCTCTAGGGAAAGGCACATTCATGTTGGGGAGAGGTGGAGAATGCAGTTTGGCCAGTCTCAGAAGAGTGAGTTTTAAGGCCTAGGAAAGTGATGGAGATGATTGCTTGTTGGAGAGAG + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF::F,FF,FFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFF @A00892:9:HM2WWDMXX:1:1101:7129:1031 1:N:0:TTACCGAC+CGAATACG CGATTCCATTAGATTTTATTTGATTTGATTCGAATCGATTCGATTTGTTTCCATGCAATTCCATGCGATTCCCTTCCATTCCATTCCATTCTGCTTTATTCCATTCCAATCCACTTCACTGCACTTCACCGCATTCCATTCCATTCCATTC
Thanks
Hi,
here the lines of my files:
head KH-04612_20200205_MS2_S21_L001.kraken C M04091:337:000000000-CPY2G:1:1101:13819:1829 Campylobacter coli (taxid 195) 81|81 194:10 195:25 194:5 195:7 |:| 195:8 194:5 195:25 194:9 U M04091:337:000000000-CPY2G:1:1101:15170:1834 unclassified (taxid 0) 131|131 0:97 |:| 0:97 C M04091:337:000000000-CPY2G:1:1101:16792:1849 Akkermansia muciniphila (taxid 239935) 250|248 0:30 239935:16 0:107 239935:7 0:56 |:| 0:55 641491:5 0:8 239935:5 0:141 C M04091:337:000000000-CPY2G:1:1101:14685:1862 Akkermansia (taxid 239934) 251|250 0:68 239935:5 0:138 1679444:6 |:| 0:190 239935:1 0:25 C M04091:337:000000000-CPY2G:1:1101:13886:1863 Akkermansia muciniphila (taxid 239935) 251|251 0:162 239935:5 0:50 |:| 0:100 239935:5 0:112 C M04091:337:000000000-CPY2G:1:1101:18301:1864 Akkermansia muciniphila (taxid 239935) 251|251 0:186 239935:1 0:30 |:| 0:217 U M04091:337:000000000-CPY2G:1:1101:16851:1883 unclassified (taxid 0) 251|251 0:217 |:| 0:217 U M04091:337:000000000-CPY2G:1:1101:17240:1887 unclassified (taxid 0) 85|85 0:51 |:| 0:51 U M04091:337:000000000-CPY2G:1:1101:18007:1889 unclassified (taxid 0) 250|250 0:216 |:| 0:216 C M04091:337:000000000-CPY2G:1:1101:18769:1896 Campylobacter coli (taxid 195) 251|251 0:105 195:112 |:| 195:6 194:7 195:114 0:90
FASTQ_R1 file:
@M04091:337:000000000-CPY2G:1:1101:13819:1829 1:N:0:21 GATGTACAAGGATCTTTAGAAGCCTTAAAAGCAAGCTTAGAAAAGCTTAGAAATGATGAAATTAAAGTCAATATCATACAC + 11>>AB@3DF??GGGGGFD13FEGGHBB1DFHHHHHHHHHHHHHHHHHHHGHHHHHHHFHGHHHGGHHHHGHHHHHHGHHH @M04091:337:000000000-CPY2G:1:1101:15170:1834 1:N:0:21 GGCTGTGGAAGAATTGGGCAACACCGGAATTCTCCCGGAAGAGTTGAAGGGAAAGAGCGCGAAGGAGGTTGAAGCGGCCGTGAAGGAGAAGGCGGAGGAACGGGCACGGCTGCAGAAGAAAATTGCCGATG + 33>AAFFBDFCFFGGGCFFFCGHDEE?22FGGHHHHGGEGGHHHHHHHHHGHHHHHHHGGGGFGGGHGGHHHHGHGGGGGGGGGHHHHHHHHHGGGGGGHHGGGGGGGGGGGGHHHHHGHHHHHHHHGGGG @M04091:337:000000000-CPY2G:1:1101:16792:1849 1:N:0:21 TCTCCAGAATGCGGGAGGAAGCGCCGGGAAGCACGCAGCCGATGAGATAGCCCGTCTGGGCGCAGCATTCCAGCAGGTGGCGCAGCACGCAGCCCAGCCGCGGGGCGCTGGCCTCGTCCTTGGCAAGCTGCCAGGGCTGCTGCTGTTCAATGTAGGCGTTGCAGGCGGAAACCTGGGCGTTCAGGGCCGCCAGCGCGTCGGAGACCATGTTCTCGTCCATGAATCTGGAAAACTGGACCACCGCCTGGTC
FASTQ_R2 file:
@M04091:337:000000000-CPY2G:1:1101:13819:1829 2:N:0:21 GTGTATGATATTGACTTTAATTTCATCATTTCTAAGCTTTTCTAAGCTTGCTTTTAAGGCTTCTAAAGATCCTTGTACATC + CCCCCFFFFFFFGGGGGGGGGGGHHHHGHHHHHHGHGHHHHHHHHHHHHHHHHHHHHHHFHHHHHHHIHHGHHHHHHHHHH @M04091:337:000000000-CPY2G:1:1101:15170:1834 2:N:0:21 CATCGGCAATTTTCTTCTGCAGCCGTGCCCGTTCCTCCGCCTTCTCCTTCACGGCCGCTTCAACCTCCTTCGCGCTCTTTCCCTTCAACTCTTCCGGGAGAATTCCGGTGTTGCCCAATTCTTCCACAGCC + BBCCCABBCFFFGGGGGGGGGGHHGGGGHHGGHHGHHGGGGGGHHHHHHHHHHEFGGGGGGHHGHHHHHHHHGGGGGHHHHHHHHHGHHHHHHHHGGGGGGHHHHHHCGFFGHHHHGHHHHHHHHHGHGHH @M04091:337:000000000-CPY2G:1:1101:16792:1849 2:N:0:21 CCGGAAAGGACGCCAATTTTGACGCGGACCGCCTGGTCATGCTGTATAATACGGAACTGGCCAACGATCTGGGCAACCTGTGCAACCGCTCCATCAACATGACCCGCCGTTATTGCGAGTCCGTGCTGCCCGCCCCCGGCGAATACGACGATGACGCTTCCAAGGCCCTGCGCGCAACCATGGACCAGGCGGTGGTCCAGTTTTCCAGATTCATGGACGAGAACATGGTCTCCGACGCGCTGGCGGCC
Thank you for trouble shooting!
Okay, both --include-children and the FASTQ reading should work now. Please test?
Hi
Sorry didn't quite work - the --include-children bit seems right as it now identified 888 taxids but it didn't save any reads IDs to search.
python extract_kraken_reads_infinity.py -k //data/strepgen/JAM_EMBER_kraken/EMB2.kraken -s1 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB2_L001_R1_001.fastq -s2 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB2_L001_R2_001.fastq -o EMB2_extract_bacterial_reads_S1.fasta -o2 EMB2_extract_bacterial_reads_S2.fasta -t 2 --include-children -r //data/strepgen/JAM_EMBER_kraken/kraken_bracken_reports/EMB2.report.txt PROGRAM START TIME: 02-10-2020 17:06:29
STEP 0: PARSING REPORT FILE //data/strepgen/JAM_EMBER_kraken/kraken_bracken_reports/EMB2.report.txt 888 taxonomy IDs to parse STEP 1: PARSING KRAKEN FILE FOR READIDS //data/strepgen/JAM_EMBER_kraken/EMB2.kraken 42.65 million reads processed 0 read IDs saved STEP 2: READING SEQUENCE FILES AND WRITING READS 0 read IDs found (1 reads processed) 0 read IDs found (1 reads processed)ed) 0 reads printed to file Generated file: EMB2_extract_bacterial_reads_S1.fasta Generated file: EMB2_extract_bacterial_reads_S2.fasta PROGRAM END TIME: 02-10-2020 17:11:44
Okay, now this might work....I had to fix the type of the taxid variables to make them consistent.
Hi
Sorry bad news. It started well and managed to identify the reads but something seems to have gone wrong writing to fasta.
python extract_kraken_reads_infinity2.py -k //data/strepgen/JAM_EMBER_kraken/EMB2.kraken -s1 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB2_L001_R1_001.fastq -s2 //scratch/strepgen/rxh/JAM_EMBER/original_reads/EMB2_L001_R2_001.fastq -o EMB2_extract_bacterial_reads_S1.fasta -o2 EMB2_extract_bacterial_reads_S2.fasta -t 2 --include-children -r //data/strepgen/JAM_EMBER_kraken/kraken_bracken_reports/EMB2.report.txt PROGRAM START TIME: 02-10-2020 20:19:48
STEP 0: PARSING REPORT FILE //data/strepgen/JAM_EMBER_kraken/kraken_bracken_reports/EMB2.report.txt 888 taxonomy IDs to parse STEP 1: PARSING KRAKEN FILE FOR READIDS //data/strepgen/JAM_EMBER_kraken/EMB2.kraken 42.65 million reads processed 17780 read IDs saved STEP 2: READING SEQUENCE FILES AND WRITING READS 17780 read IDs found (42643368 reads processed) 9 read IDs found (42647627 reads processed) Traceback (most recent call last): File "extract_kraken_reads_infinity2.py", line 411, in
main() File "extract_kraken_reads_infinity2.py", line 393, in main SeqIO.write(save_seqs2[i], o_file2, "fasta") KeyError: 'A00892:9:HM2WWDMXX:1:1101:12183:3740'
Thanks
Hi
Just had a look at the fasta output files - the R2 was empty but the R1 had written something
A00892:9:HM2WWDMXX:1:1101:12183:3740
GTCTGAAACTTGAAAGCTGATAATCACGCCAAAGGCGTGTGGTTATTGGCTTTTTTAGCA AAAAAATAAACGCAAGCTCTCCTGCGTTGTATACTTTAGTCGACCAAAACCAAAATACAA AGGAGACTTACGTCATGAATATTATAAGACA
Hi,
same in my case:
`python /home/BatchProcessFiles/extract_kraken_reads.py -k kraken_test/KH-04612_20200205_MS2_S21_L001.kraken -s KH-04612_20200205_MS2_S21_L001_R1_001.fastq -s2 KH-04612_20200205_MS2_S21_L001_R2_001.fastq -o TEST_r1.fasta -o2 TEST_r2.fasta -t 195 --include-children -r kraken_test/KH-04612_20200205_MS2_S21_L001 PROGRAM START TIME: 02-11-2020 07:10:12
STEP 0: PARSING REPORT FILE kraken_test/KH-04612_20200205_MS2_S21_L001 6 taxonomy IDs to parse STEP 1: PARSING KRAKEN FILE FOR READIDS kraken_test/KH-04612_20200205_MS2_S21_L001.kraken 0.61 million reads processed 142676 read IDs saved STEP 2: READING SEQUENCE FILES AND WRITING READS 142676 read IDs found (610424 reads processed) 1 read IDs found (610431 reads processed) Traceback (most recent call last): File "/home/BatchProcessFiles/extract_kraken_reads.py", line 411, in
main() File "/home/BatchProcessFiles/extract_kraken_reads.py", line 393, in main SeqIO.write(save_seqs2[i], o_file2, "fasta") KeyError: 'M04091:337:000000000-CPY2G:1:1101:13819:1829'`
I also have something inside the R1-File, however the R2 is empty as in richardhaigh's case:
>M04091:337:000000000-CPY2G:1:1101:13819:1829 <unknown description> GATGTACAAGGATCTTTAGAAGCCTTAAAAGCAAGCTTAGAAAAGCTTAGAAATGATGAA ATTAAAGTCAATATCATACAC
Sorry this update took me so long to fix but I think I found the bug. Please try it again and let me know if you continue to get any other errors?
Hey,
thank you very much! The basic commands are working now. However, the "inlcude-children" option is still not working:
(biopython) gwdu103:6 09:21:40 /home/NGS/NGS_Run26_20200205_CampyColi > python /home/BatchProcessFiles/extract_kraken_reads.py -k kraken_test/KH-04612_20200205_MS2_S21_L001.kraken -s KH-04612_20200205_MS2_S21_L001_R1_001.fastq -s2 KH-04612_20200205_MS2_S21_L001_R2_001.fastq -o TEST1exclude.fasta -o2 TEST2exclude.fasta -t 1783257 --exclude --include-children --report kraken_test/KH-04612_20200205_MS2_S21_L001 PROGRAM START TIME: 02-18-2020 08:57:03
STEP 0: PARSING REPORT FILE kraken_test/KH-04612_20200205_MS2_S21_L001 101 taxonomy IDs to parse STEP 1: PARSING KRAKEN FILE FOR READIDS kraken_test/KH-04612_20200205_MS2_S21_L001.kraken 0 reads processedTraceback (most recent call last): File "/home/BatchProcessFiles/extract_kraken_reads.py", line 411, in
main() File "/home/BatchProcessFiles/extract_kraken_reads.py", line 283, in main save_taxids[tax_id] += 1 KeyError: 195
Hi
Thanks for the update. I'm currently locked out of the NoMachine interface for our system as IT are "fixing" something so can't test the new script just yet but will do asap.
R
I believe the exclude option was what caused your error this time @HumbiFred. I fixed the script again....and hoping that there are no more bugs but we will see.
Hey Jennifer,
now its working perfectly! Thank you so much for your effort. Very nice and handy tool!
Have a nice day
Hi Jennifer,
the script is running fine with the --exclude flag. However, I am getting the following error by using the latest version of your script when --exclude is not set.
/scratch1/lennard/bin/KrakenTools/extract_kraken_reads.py -k ERR1463892_reads.out --include-children -t 2 -o test1.fq -o2 test2.fq -s /scratch1/lennard/Data/Anna/ERR1463892_1.fastq.gz -s2 /scratch1/lennard/Data/Anna/ERR1463892_2.fastq.gz -r ERR1463892.report
PROGRAM START TIME: 02-19-2020 12:19:45
>> STEP 0: PARSING REPORT FILE ERR1463892.report
198 taxonomy IDs to parse
>> STEP 1: PARSING KRAKEN FILE FOR READIDS ERR1463892_reads.out
0 reads processedTraceback (most recent call last):
File "/scratch1/lennard/bin/KrakenTools/extract_kraken_reads.py", line 417, in <module>
main()
File "/scratch1/lennard/bin/KrakenTools/extract_kraken_reads.py", line 285, in main
elif (tax_id not in exclude_taxids) and args.exclude:
UnboundLocalError: local variable 'exclude_taxids' referenced before assignment
Oh I hate those errors. I just added another line that should fix it @eppinglen
Hi
Latest version is running fine and --include-children is working correctly for me now. I haven't tried the -exclude function but will have a go later. It still doesn't like the fastq files being in gz but that is a minor niggle easily bypassed.
Thanks very much for sorting this out - much appreciated.
R
Thanks for all of the testing....
@richarddhaigh , I'm not sure why its not happy with the gzipped files. When I test on mine, as long as they have the ".gz" extension, they open fine
I have a similar problem to the original post here. I am running the v1.0.1 release with the --include-children option following a kraken2 run with --use-names and --report-minimizer-data options.
I receive this error when running with the --include-children or --include-parents options, but it works fine without these options.
Run with error:
extract_kraken_reads.py --fastq-output --include-children -k $KrakenOut -r $KrakenReport -s $Fasta -o $OutDir/${Strain}_${Target}_${TaxaID}_pacbio.fastq -t $TaxaID
PROGRAM START TIME: 04-14-2021 14:47:57
>> STEP 0: PARSING REPORT FILE analysis/kraken2/pacbio/B.tabaci/MEAM1_BC3/MEAM1_BC3_kraken2_report.txt
Traceback (most recent call last):
File "/mnt/beegfs/home/aa0377h/prog/kraken2/KrakenTools-1.0.1/extract_kraken_reads.py", line 433, in <module>
main()
File "/mnt/beegfs/home/aa0377h/prog/kraken2/KrakenTools-1.0.1/extract_kraken_reads.py", line 223, in main
while level_num != (prev_node.level_num + 1):
AttributeError: 'int' object has no attribute 'level_num'
Successful run without --include-children or --include-parents options:
extract_kraken_reads.py --fastq-output -k $KrakenOut -r $KrakenReport -s $Fasta -o $OutDir/${Strain}_${Target}_${TaxaID}_pacbio.fastq -t $TaxaID
PROGRAM START TIME: 04-14-2021 15:03:43
1 taxonomy IDs to parse
>> STEP 1: PARSING KRAKEN FILE FOR READIDS analysis/kraken2/pacbio/B.tabaci/MEAM1_BC3/MEAM1_BC3_kraken2_out.txt
2.31 million reads processed
34183 read IDs saved
>> STEP 2: READING SEQUENCE FILES AND WRITING READS
34183 read IDs found (2.31 mill reads processed)
34183 reads printed to file
Generated file: qc_dna/pacbio_filtered/B.tabaci/MEAM1_BC3/R.sp./MEAM1_BC3_R.sp._1182263_pacbio.fastq
PROGRAM END TIME: 04-14-2021 15:21:45
Hi @jenniferlu717 and others,
I am new to bioinformatics and there was an error when running python extract_kraken_reads.py.
The message was File "extract_kraken_reads.py", line 7 !DOCTYPE html <!DOCTYPE html> ^ SyntaxError: invalid syntax
Can you please advise? Thanks in advance.
Hi vdnadung
Just a another kraken user here but I will see if I can help.
Hard to diagnose just from the error - could you give a bit more detail of what you were trying to do and exactly what the command you used was.
R
Hi Richard,
That would be great and much appreciated.
I use Ubuntu. From the working directory with extract_kraken_reads.py file, myfile.kraken file, read1 and read2 files, I executed python extract_kraken_reads.py -k myfile.kraken -s1 reads1.fq.gz -s2 reads2.fq.gz -o myfile.reads.fasta -t 11269
The error message popped up as above. The same error message when I just executed python extract_kraken_reads.py.
If you need any other information, please let me know.
Thank you. Dung
Hi Dung
Just a few quick observations from looking at the commands I've used successfully.
I used biopython for running this script rather than "ordinary" python - can't see anything in my notes but I guess there was a reason for this. You have two input fastq but only one output fasta I would try adding in "-o2 myfile.reads.reverse.fasta". Also I never got the script to work with fastq.gz and had to unzip them first - don't know why but obviously something in my set-up as others haven't had problems - perhaps also worth a try for for you.
Hope that helps
R
Hi Richard,
Thank you for your help. I have tried all your advice (loading module biopython instead of ordinary python, adding -o2 file, using .fq instead of .fq.gz files but the problem remains. Again even if I executed only "python extract_kraken_reads.py", the error message was the same. Can you please send me your extract_kraken_reads.py file to try?
Thank you and best wishes Dung
Hi Dung
This is a version jennifer re-edited for us back in Sept 2022 and is the one I was using last in october 2023 - I can see she has updated again more recently but I haven't tried that newer version.
######################################################################
#
#
#
# ######################################################################
#
#
###################################################################### import os, sys, argparse import gzip from time import gmtime from time import strftime from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord #################################################################################
class Tree(object): 'Tree node.' def init(self, taxid, level_num, level_id, children=None, parent=None): self.taxid = taxid self.level_num = level_num self.level_id = level_id self.children = [] self.parent = parent if children is not None: for child in children: self.add_child(child) def add_child(self, node): assert isinstance(node,Tree) self.children.append(node) #################################################################################
def process_kraken_output(kraken_line): l_vals = kraken_line.split('\t') if len(l_vals) < 5: return [-1, ''] if "taxid" in l_vals[2]: temp = l_vals[2].split("taxid ")[-1] tax_id = temp[:-1] else: tax_id = l_vals[2]
read_id = l_vals[1]
if (tax_id == 'A'):
tax_id = 81077
else:
tax_id = int(tax_id)
return [tax_id, read_id]
def process_kraken_report(report_line): l_vals = report_line.strip().split('\t') if len(l_vals) < 5: return [] try: int(l_vals[1]) except ValueError: return []
try:
taxid = int(l_vals[-3])
level_type = l_vals[-2]
map_kuniq = {'species':'S', 'genus':'G','family':'F',
'order':'O','class':'C','phylum':'P','superkingdom':'D',
'kingdom':'K'}
if level_type not in map_kuniq:
level_type = '-'
else:
level_type = map_kuniq[level_type]
except ValueError:
taxid = int(l_vals[-2])
level_type = l_vals[-3]
#Get spaces to determine level num
spaces = 0
for char in l_vals[-1]:
if char == ' ':
spaces += 1
else:
break
level_num = int(spaces/2)
return[taxid, level_num, level_type]
################################################################################
def main():
parser = argparse.ArgumentParser()
parser.add_argument('-k', dest='kraken_file', required=True,
help='Kraken output file to parse')
parser.add_argument('-s','-s1', '-1', '-U', dest='seq_file1', required=True,
help='FASTA/FASTQ File containing the raw sequence letters.')
parser.add_argument('-s2', '-2', dest='seq_file2', default= "",
help='2nd FASTA/FASTQ File containing the raw sequence letters (paired).')
parser.add_argument('-t', "--taxid",dest='taxid', required=True,
nargs='+',
help='Taxonomy ID[s] of reads to extract (space-delimited)')
parser.add_argument('-o', "--output",dest='output_file', required=True,
help='Output FASTA/Q file containing the reads and sample IDs')
parser.add_argument('-o2',"--output2", dest='output_file2', required=False, default='',
help='Output FASTA/Q file containig the second pair of reads [required for paired input]')
parser.add_argument('--append', dest='append', action='store_true',
help='Append the sequences to the end of the output FASTA file specified.')
parser.add_argument('--noappend', dest='append', action='store_false',
help='Create a new FASTA file containing sample sequences and IDs \
(rewrite if existing) [default].')
parser.add_argument('--max', dest='max_reads', required=False,
default=100000000, type=int,
help='Maximum number of reads to save [default: 100,000,000]')
parser.add_argument('-r','--report',dest='report_file', required=False,
default="",
help='Kraken report file. [required only if --include-parents/children \
is specified]')
parser.add_argument('--include-parents',dest="parents", required=False,
action='store_true',default=False,
help='Include reads classified at parent levels of the specified taxids')
parser.add_argument('--include-children',dest='children', required=False,
action='store_true',default=False,
help='Include reads classified more specifically than the specified taxids')
parser.add_argument('--exclude', dest='exclude', required=False,
action='store_true',default=False,
help='Instead of finding reads matching specified taxids, finds all reads NOT matching specified taxids')
parser.add_argument('--fastq-output', dest='fastq_out', required=False,
action='store_true',default=False,
help='Print output FASTQ reads [requires input FASTQ, default: output is FASTA]')
parser.set_defaults(append=False)
args=parser.parse_args()
#Start Program
time = strftime("%m-%d-%Y %H:%M:%S", gmtime())
sys.stdout.write("PROGRAM START TIME: " + time + '\n')
#Check input
if (len(args.output_file2) == 0) and (len(args.seq_file2) > 0):
sys.stderr.write("Must specify second output file -o2 for paired input\n")
sys.exit(1)
#Initialize taxids
save_taxids = {}
for tid in args.taxid:
save_taxids[int(tid)] = 0
main_lvls = ['R','K','D','P','C','O','F','G','S']
#STEP 0: READ IN REPORT FILE AND GET ALL TAXIDS
if args.parents or args.children:
#check that report file exists
if args.report_file == "":
sys.stderr.write(">> ERROR: --report not specified.")
sys.exit(1)
sys.stdout.write(">> STEP 0: PARSING REPORT FILE %s\n" % args.report_file)
#create tree and save nodes with taxids in the list
base_nodes = {}
r_file = open(args.report_file,'r')
prev_node = -1
for line in r_file:
#extract values
report_vals = process_kraken_report(line)
if len(report_vals) == 0:
continue
[taxid, level_num, level_id] = report_vals
if taxid == 0:
continue
#tree root
if taxid == 1:
level_id = 'R'
root_node = Tree(taxid, level_num, level_id)
prev_node = root_node
#save if needed
if taxid in save_taxids:
base_nodes[taxid] = root_node
continue
#move to correct parent
while level_num != (prev_node.level_num + 1):
prev_node = prev_node.parent
#determine correct level ID
if level_id == '-' or len(level_id) > 1:
if prev_node.level_id in main_lvls:
level_id = prev_node.level_id + '1'
else:
num = int(prev_node.level_id[-1]) + 1
level_id = prev_node.level_id[:-1] + str(num)
#make node
curr_node = Tree(taxid, level_num, level_id, None, prev_node)
prev_node.add_child(curr_node)
prev_node = curr_node
#save if taxid matches
if taxid in save_taxids:
base_nodes[taxid] = curr_node
r_file.close()
#FOR SAVING PARENTS
if args.parents:
#For each node saved, traverse up the tree and save each taxid
for tid in base_nodes:
curr_node = base_nodes[tid]
while curr_node.parent != None:
curr_node = curr_node.parent
save_taxids[curr_node.taxid] = 0
#FOR SAVING CHILDREN
if args.children:
for tid in base_nodes:
curr_nodes = base_nodes[tid].children
while len(curr_nodes) > 0:
#For this node
curr_n = curr_nodes.pop()
if curr_n.taxid not in save_taxids:
save_taxids[curr_n.taxid] = 0
#Add all children
if curr_n.children != None:
for child in curr_n.children:
curr_nodes.append(child)
##############################################################################
sys.stdout.write("\t%i taxonomy IDs to parse\n" % len(save_taxids))
sys.stdout.write(">> STEP 1: PARSING KRAKEN FILE FOR READIDS %s\n" % args.kraken_file)
#Initialize values
count_kraken = 0
read_line = -1
exclude_taxids = {}
if args.exclude:
exclude_taxids = save_taxids
save_taxids = {}
#PROCESS KRAKEN FILE FOR CLASSIFIED READ IDS
k_file = open(args.kraken_file, 'r')
sys.stdout.write('\t0 reads processed')
sys.stdout.flush()
#Evaluate each sample in the kraken file
save_readids = {}
save_readids2 = {}
for line in k_file:
count_kraken += 1
if (count_kraken % 10000 == 0):
sys.stdout.write('\r\t%0.2f million reads processed' % float(count_kraken/1000000.))
sys.stdout.flush()
#Parse line for results
[tax_id, read_id] = process_kraken_output(line)
if tax_id == -1:
continue
#Skip if reads are human/artificial/synthetic
if (tax_id in save_taxids) and not args.exclude:
save_taxids[tax_id] += 1
save_readids2[read_id] = 0
save_readids[read_id] = 0
elif (tax_id not in exclude_taxids) and args.exclude:
if tax_id not in save_taxids:
save_taxids[tax_id] = 1
else:
save_taxids[tax_id] += 1
save_readids2[read_id] = 0
save_readids[read_id] = 0
if len(save_readids) >= args.max_reads:
break
#Update user
k_file.close()
sys.stdout.write('\r\t%0.2f million reads processed\n' % float(count_kraken/1000000.))
sys.stdout.write('\t%i read IDs saved\n' % len(save_readids))
##############################################################################
#Sequence files
seq_file1 = args.seq_file1
seq_file2 = args.seq_file2
####TEST IF INPUT IS FASTA OR FASTQ
if(seq_file1[-3:] == '.gz'):
s_file1 = gzip.open(seq_file1,'rt')
else:
s_file1 = open(seq_file1,'rt')
first = s_file1.readline()
if len(first) == 0:
sys.stderr.write("ERROR: sequence file's first line is blank\n")
sys.exit(1)
if first[0] == ">":
filetype = "fasta"
elif first[0] == "@":
filetype = "fastq"
else:
sys.stderr.write("ERROR: sequence file must be FASTA or FASTQ\n")
sys.exit(1)
s_file1.close()
if filetype != 'fastq' and args.fastq_out:
sys.stderr.write('ERROR: for FASTQ output, input file must be FASTQ\n')
sys.exit(1)
####ACTUALLY OPEN FILE
if(seq_file1[-3:] == '.gz'):
#Zipped Sequence Files
s_file1 = gzip.open(seq_file1,'rt')
if len(seq_file2) > 0:
s_file2 = gzip.open(seq_file2,'rt')
else:
s_file1 = open(seq_file1, 'r')
if len(seq_file2) > 0:
s_file2 = open(seq_file2, 'r')
#PROCESS INPUT FILE AND SAVE FASTA FILE
sys.stdout.write(">> STEP 2: READING SEQUENCE FILES AND WRITING READS\n")
sys.stdout.write('\t0 read IDs found (0 mill reads processed)')
sys.stdout.flush()
#Open output file
if (args.append):
o_file = open(args.output_file, 'a')
if args.output_file2 != '':
o_file2 = open(args.output_file2, 'a')
else:
o_file = open(args.output_file, 'w')
if args.output_file2 != '':
o_file2 = open(args.output_file2, 'w')
#Process SEQUENCE 1 file
count_seqs = 0
count_output = 0
for record in SeqIO.parse(s_file1,filetype):
count_seqs += 1
#Print update
if (count_seqs % 1000 == 0):
sys.stdout.write('\r\t%i read IDs found (%0.2f mill reads processed)' % (count_output, float(count_seqs/1000000.)))
sys.stdout.flush()
#Check ID
test_id = str(record.id)
test_id2 = test_id
if ("/1" in test_id) or ("/2" in test_id):
test_id2 = test_id[:-2]
#Sequence found
if test_id in save_readids or test_id2 in save_readids:
count_output += 1
#Print update
sys.stdout.write('\r\t%i read IDs found (%0.2f mill reads processed)' % (count_output, float(count_seqs/1000000.)))
sys.stdout.flush()
#Save to file
if args.fastq_out:
SeqIO.write(record, o_file, "fastq")
else:
SeqIO.write(record, o_file, "fasta")
#If no more reads to find
if len(save_readids) == count_output:
break
#Close files
s_file1.close()
o_file.close()
sys.stdout.write('\r\t%i read IDs found (%0.2f mill reads processed)\n' % (count_output, float(count_seqs/1000000.)))
sys.stdout.flush()
if len(seq_file2) > 0:
count_output = 0
count_seqs = 0
sys.stdout.write('\t%i read IDs found (%0.2f mill reads processed)' % (count_output, float(count_seqs/1000000.)))
sys.stdout.flush()
for record in SeqIO.parse(s_file2, filetype):
count_seqs += 1
#Print update
if (count_seqs % 1000 == 0):
sys.stdout.write('\r\t%i read IDs found (%0.2f mill reads processed)' % (count_output, float(count_seqs/1000000.)))
sys.stdout.flush()
test_id = str(record.id)
test_id2 = test_id
if ("/1" in test_id) or ("/2" in test_id):
test_id2 = test_id[:-2]
#Sequence found
if test_id in save_readids or test_id2 in save_readids:
count_output += 1
sys.stdout.write('\r\t%i read IDs found (%0.2f mill reads processed)' % (count_output, float(count_seqs/1000000.)))
sys.stdout.flush()
#Save to file
if args.fastq_out:
SeqIO.write(record, o_file2, "fastq")
else:
SeqIO.write(record, o_file2, "fasta")
#If no more reads to find
if len(save_readids) == count_output:
break
s_file2.close()
o_file2.close()
#End Program
sys.stdout.write('\r\t%i read IDs found (%0.2f mill reads processed)\n' % (count_output, float(count_seqs/1000000.)))
#End Program
sys.stdout.write('\t' + str(count_output) + ' reads printed to file\n')
sys.stdout.write('\tGenerated file: %s\n' % args.output_file)
if args.output_file2 != '':
sys.stdout.write('\tGenerated file: %s\n' % args.output_file2)
#End of program
time = strftime("%m-%d-%Y %H:%M:%S", gmtime())
sys.stdout.write("PROGRAM END TIME: " + time + '\n')
sys.exit(0)
#################################################################################
if name == "main": main()
################################################################################# #################################END OF PROGRAM################################## #################################################################################
Hi Richard,
Thanks very much. It works now.
To get the extract_kraken_reads.py, I followed a document online: right click on the extract_kraken_reads.py file then "save as". From the text you sent, I realised it's different from the extract_kraken_reads.py I saved which looks as below. This explains the error message.
And you are correct that it seems to require biopython.
Many thanks for your help. Best wishes Dung
Hi
I'm trying to extract all bacterial reads from a paired-end kraken analysis but I am getting an error when the script tries to parse the kraken.report.txt. I'm running under most recent version of Biopython and have just updated to your latest script - the error I'm getting is:
PROGRAM START TIME: 02-06-2020 17:27:02
Sorry my python is pretty poor so I can't fathom if it is a script problem or a problem in my report file - any thoughts?
Thanks
Rich