jenniferlu717 / Bracken

Bracken (Bayesian Reestimation of Abundance with KrakEN) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample.
http://ccb.jhu.edu/software/bracken/index.shtml
GNU General Public License v3.0
294 stars 50 forks source link

Wrong docs for https://benlangmead.github.io/aws-indexes/k2 prebuilt DBs #249

Open paulzierep opened 8 months ago

paulzierep commented 8 months ago

When updating the wrapper for the bracken data manager on galaxy, we found that in the docs of https://benlangmead.github.io/aws-indexes/k2 the bracken DBs are referred to as All packages contain a Kraken 2 database along with Bracken databases built for 50, 75, 100, 150, 200, 250 and 300-mers.; But in the readme you can find that:

Bracken files (`*.kmer_distrib`) were generated using
bracken-build -k 35 -l 50 -d 16S_Greengenes_k2db -t 35 
bracken-build -k 35 -l 75 -d 16S_Greengenes_k2db -t 35 
bracken-build -k 35 -l 100 -d 16S_Greengenes_k2db -t 35 
bracken-build -k 35 -l 150 -d 16S_Greengenes_k2db -t 35 
bracken-build -k 35 -l 200 -d 16S_Greengenes_k2db -t 35 
bracken-build -k 35 -l 250 -d 16S_Greengenes_k2db -t 35 
```

Where `-l` refers to the read length. See https://github.com/galaxyproject/tools-iuc/issues/5745 for complete explanaition. Can you confirm that this is indeed a mixup ? 

I assume this is somehow connected to the logic, that bracken can choose a specific read length from multiple builds:

bracken -d ${KRAKEN_DB} -i ${SAMPLE}.kreport -o ${SAMPLE}.bracken -r ${READ_LEN} -l ${LEVEL} -t ${THRESHOLD}



But uses this file naming schema: `databaseXmers.kmer_distrib`; (which is confusing !) any reason for that.
jenniferlu717 commented 8 months ago

The documentation is not incorrect. "50, 75, 100, 150, 200, 250 and 300-mers" is referring to read length. For a single kraken database, there is only one k-mer length (35 for kraken2 databases by default), but the 50mers, etc, is referring to the read length, mers. not kmers.

We include all of those files because depending on the user's sample read length, they should use a different bracken file.