jenniferlu717 / Bracken

Bracken (Bayesian Reestimation of Abundance with KrakEN) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample.
http://ccb.jhu.edu/software/bracken/index.shtml
GNU General Public License v3.0
273 stars 50 forks source link

Strain / genome level count data? #95

Open taltman opened 4 years ago

taltman commented 4 years ago

Looking at the Bracken source code, I see that counts do get down to the level of individual genomes. But this level of count data is not reported to the user.

Seems like it wouldn't be that much additional work to get Bracken to report down to the individual genome level, as other tools do. Could this feature be added?

Thanks!

jenniferlu717 commented 4 years ago

I think the code does give counts down to that level if you specify S1 as the taxonomy level? Are you not seeing this option?

taltman commented 4 years ago

Hi @jenniferlu717 . I don't see the S1 option in the bracken script nor in est_abundance.py in the git repository that I cloned. Can you point me in the right direction?

taltman commented 4 years ago

Also, can you provide more description about the G1 and K7 non-traditional taxonomy level options that are alluded to in the README? Thank you!

jenniferlu717 commented 4 years ago

Have you tried running with "S1" as the option? Bracken should take that option, although the default is "S"

The G1 taxonomy level is the first level inbetween species and genera (so above species, below genera). Sometimes, the taxonomy has an inbetween level for things such as "Mycobacterium tuberculosis complex".

K7 was included because sometimes what happens is that someone might have a custom taxonomy where they did not specify the traditional taxonomy levels, so Kraken basically assigns levels based on distance from root (the number designates distance from root).

taltman commented 4 years ago

Thanks for that tip. So it seems that Bracken accepts any integer suffix to the documented taxonomy level characters? I could specify 'S11235' with no issues...

My next question is whether there is a way to get the Bracken report to use the reference sequence accessions instead of the free-text description of the organism. I know this data is in the Bracken reference data, so there should be a way to extract it. Any advice?

jenniferlu717 commented 4 years ago

Yes integer suffixes work.

Bracken uses only the kmer_distrib file and the report for the classification. Not everyone has the original databaseXmers.kraken file so those accessions may not be available.

You could go back and map the taxids to the accession by parsing the databaseXmers.kraken file if you have it still.