smdabdoub / kraken-biom

Create BIOM-format tables (http://biom-format.org) from Kraken output (http://ccb.jhu.edu/software/kraken/, https://github.com/DerrickWood/kraken).
MIT License
47 stars 15 forks source link

Compatibility with krakenhll #5

Open fconstancias opened 6 years ago

fconstancias commented 6 years ago

Hi, Thanks a lot for your useful script. I would like to use kraken-biom in order to process krakenhll output (krakenhll adds some additional functionality to kraken to decrease false positive detection rate).

The kraken-report is a bit different and I guess that is why I got the following error running kraken-biom from report generated using krakenhll:

Traceback (most recent call last): File "/usr/local/bioinfo/kraken-biom/1.0.1a/venv/bin/kraken-biom", line 11, in sys.exit(main()) File "/gs7k1/binaries/kraken-biom/1.0.1a/venv/lib/python3.4/site-packages/kraken_biom.py", line 377, in main biomT = create_biom_table(sample_counts, taxa) File "/gs7k1/binaries/kraken-biom/1.0.1a/venv/lib/python3.4/site-packages/kraken_biom.py", line 196, in create_biom_table generated_by=gen_str, input_is_dense=True) File "/gs7k1/binaries/kraken-biom/1.0.1a/venv/lib/python3.4/site-packages/biom/table.py", line 397, in init errcheck(self) File "/gs7k1/binaries/kraken-biom/1.0.1a/venv/lib/python3.4/site-packages/biom/err.py", line 472, in errcheck raise ret biom.exception.TableException: Number of sample IDs differs from matrix size!

e.g.: kraken report: 98.27 98274 98274 U 0 unclassified 1.73 1726 74 - 1 root 1.60 1601 9 - 131567 cellular organisms 1.56 1560 142 D 2 Bacteria 1.06 1056 77 P 1224 Proteobacteria 0.61 615 62 C 28211 Alphaproteobacteria 0.35 351 3 O 204455 Rhodobacterales 0.34 336 66 F 31989 Rhodobacteraceae 0.06 55 0 G 97050 Ruegeria 0.03 34 0 S 89184 Ruegeria pomeroyi 0.03 34 34 - 246200 Ruegeria pomeroyi DSS-3 0.02 21 21 S 292414 Ruegeria sp. TM1040 0.04 41 11 G 302485 Phaeobacter 0.03 27 0 S 60890 Phaeobacter gallaeciensis 0.02 17 17 - 1423144 Phaeobacter gallaeciensis DSM 26640 0.01 10 10 - 383629 Phaeobacter gallaeciensis 2.10 0.00 3 0 S 221822 Phaeobacter inhibens 0.00 3 3 - 391619 Phaeobacter inhibens DSM 17395 0.03 31 0 G 1060 Rhodobacter

krakenhll % reads taxReads kmers dup cov taxID rank taxName 99.12 991219 991219 349731445 1.17 NA 0 no rank unclassified 0.8781 8781 0 43875 1.98 4.21e-05 1 no rank root 0.8781 8781 0 43875 1.98 4.21e-05 131567 no rank cellular organisms 0.8781 8781 55 43875 1.98 4.21e-05 2157 superkingdom Archaea 0.8388 8388 101 41001 1.71 4.384e-05 28890 phylum Euryarchaeota 0.657 6570 747 30799 1.43 5.622e-05 183963 class Halobacteria 0.2435 2435 85 10206 1.32 4.847e-05 1644055 order Haloferacales 0.149 1490 68 5681 1.39 5.224e-05 1963271 family Halorubraceae 0.0896 896 371 3346 1.39 4.358e-05 56688 genus Halorubrum 0.0037 37 37 99 1.42 4.197e-05 1419722 species Halorubrum sp. SD626R 0.0034 34 34 144 1.16 6.174e-05 1765655 species Halorubrum tropicale

Do you have any idea how to use kraken-biom with this different format? Many thanks

smdabdoub commented 6 years ago

Thanks for bringing KrakenHLL to my attention, I wasn't aware of it previously.

The new format is the same with new columns inserted in the middle. Although your example appears to contain a 'cov' column that is not listed on the KrakenHLL GitHub README.

In any case, as long as you just want the same functionality from kraken-biom with the new format, the changes would be minimal. If you wanted to include the additional columns as metadata in the output BIOM table or use the kmer information to limit which reads are included in the output (for example), that would take some additional work.