smdabdoub / kraken-biom

Create BIOM-format tables (http://biom-format.org) from Kraken output (http://ccb.jhu.edu/software/kraken/, https://github.com/DerrickWood/kraken).
MIT License
47 stars 15 forks source link

compatibility with bracken #2

Closed ajaybabu27 closed 7 years ago

ajaybabu27 commented 7 years ago

Will this tool still be compatible with the kreport output from bracken ?

smdabdoub commented 7 years ago

Yes. The 5/31/2016 update on the Bracken site says:

This release adds another output file to the est_abundance.py script. The new output file (with a _bracken.report extension) is in the same style as the Kraken report file, with reads distributed to the desired level.

kraken-biom expects Kraken report files, so there should be no problem using Bracken output with kraken-biom.

But please let me know if you have any issues with it.

ajaybabu27 commented 7 years ago

Thanks for getting back to me. I compared the report file from Bracken and the biom files generated by kraken-biom tool. There seems to be a large issue here.

Bracken redistributes reads from higher to lower taxon levels and thus the report files it generates contain "zero" reads directly assigned to taxon levels above species (S). This would be the third column in the report file. The second column now contains a cumulative sum of the reads from S to D.

I am guessing that in order to allow your code to process Bracken output you would have to process the second column instead of the third column as it is done in the Kraken output.

I have pasted a sample for comparison of both kraken and bracken outputs:

Kraken Output:

26.12 9918653 9918653 U 0 unclassified 73.88 28059325 1182 - 1 root 73.87 28055310 98 - 131567 cellular organisms 73.75 28007044 8500 D 2 Bacteria 68.22 25907794 15534 - 1783272 Terrabacteria group 37.35 14185842 53535 P 1239 Firmicutes 36.70 13937530 58455 C 91061 Bacilli 36.45 13842775 70723 O 186826 Lactobacillales 33.49 12719109 40070 F 81852 Enterococcaceae 33.37 12674215 236758 G 1350 Enterococcus 32.47 12331937 10495618 S 1352 Enterococcus faecium

Bracken Output:

26.12 9918653 9918653 U 0 unclassified 73.88 28059300 0 R 1 root 73.88 28056420 0 R1 131567 cellular organisms 73.75 28008249 0 D 2 Bacteria 68.24 25917279 0 D1 1783272 Terrabacteria group 37.39 14200770 0 P 1239 Firmicutes 36.88 14004816 0 C 91061 Bacilli 36.78 13967720 0 O 186826 Lactobacillales 34.00 12911294 0 F 81852 Enterococcaceae 33.98 12906404 0 G 1350 Enterococcus 33.70 12798327 12798327 S 1352 Enterococcus faecium

smdabdoub commented 7 years ago

The third column still contains the reads assigned directly to species, which after running Bracken are the only assigned reads. So the only change you should need to make when running kraken-biom would be to set --max to S. That way both min and max are set to Species and only the species-assigned reads will be extracted from the report file and written to the BIOM table.

ajaybabu27 commented 7 years ago

That sounds good to me. Thanks for taking your time to review it.

By the way, I get the following error when I use --max S argument:

ERROR: Max and Min ranks are out of order: S < S

The default command should still work fine without the --max argument, since taxons with zero values across all samples are discarded by kraken-biom.

Cheers, Ajay.

smdabdoub commented 7 years ago

Happy to help. And thanks for passing along the error message.

Setting both to S is actually a part of the test suite for the parsing, but I apparently did't test it before it gets there. It looks like I have a >= where I should just have a > where I'm sanity-checking the command-line parameters. Should be an easy fix, I'll open an issue to remind myself.