jenniferlu717 / Bracken

Bracken (Bayesian Reestimation of Abundance with KrakEN) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample.
http://ccb.jhu.edu/software/bracken/index.shtml
GNU General Public License v3.0
289 stars 50 forks source link

silva 138, library folder not found #105

Open ganiatgithub opened 4 years ago

ganiatgithub commented 4 years ago

Hi Jennifer,

My apologies in advance if there is an solution already. I'm using bracken together with the pre-built silva 138 db. Command is:

bracken \
-d /home/Staff/uqgni1/tools/kraken2-silva138/16S_SILVA138_k2db \
-i kk2-trial02.report  \
-o trial02.bracken.def \
-w trial02.bracken.report \
-t 1000

Error says:

 >> Selected Options:
       kmer length = 35
       read length = 100
       database    = /home/Staff/uqgni1/tools/kraken2-silva138/16S_SILVA138_k2db
       threads     = 48-k
 >> Checking for Valid Options...
 ERROR: Database library /home/Staff/uqgni1/tools/kraken2-silva138/16S_SILVA138_k2db/library does not exist

But actually the files are in this path:

ll /home/Staff/uqgni1/tools/kraken2-silva138/16S_SILVA138_k2db
total 131604
-rw-rw-r-- 1 uqgni1 uusers   1830513 Mar 27 13:41 database100mers.kmer_distrib
-rw-rw-r-- 1 uqgni1 uusers   2095782 Mar 27 13:38 database150mers.kmer_distrib
-rw-rw-r-- 1 uqgni1 uusers   1801059 Mar 27 13:34 database200mers.kmer_distrib
-rw-rw-r-- 1 uqgni1 uusers   2136055 Mar 27 13:50 database250mers.kmer_distrib
-rw-rw-r-- 1 uqgni1 uusers   1586943 Mar 27 13:42 database50mers.kmer_distrib
-rw-rw-r-- 1 uqgni1 uusers   1693482 Mar 27 13:45 database75mers.kmer_distrib
-rw-rw-r-- 1 uqgni1 uusers 147504304 Mar 27 13:29 hash.k2d
-rw-rw-r-- 1 uqgni1 uusers        56 Mar 27 13:29 opts.k2d
-rw-rw-r-- 1 uqgni1 uusers      2626 Mar 27 14:08 README.md
-rw-rw-r-- 1 uqgni1 uusers    715644 Mar 27 13:28 taxo.k2d

Any suggestions?

Kind regards, Gaofeng

Wrzlprmft commented 4 years ago

This is weird since the error message you get is not part of bracken but of bracken-build. In fact it’s the error message I would expect when running bracken-build with the same arguments. Did you perhaps accidentally mislink bracken-build to bracken?

ganiatgithub commented 4 years ago

Hi,

You're right, I got mixed up with another error message when running bracken-build. I made sure it's the correct call this time, tested with 100k reads that has length between 1.2 to 1.6 kb, and none were distributed by bracken. Here is the log:

 >> Checking for Valid Options...
 >> Running Bracken
/home/Staff/uqgni1/miniconda2/envs/kraken/bin/bracken: line 98: [: trial02.bracken.report: unary operator expected
      >> python src/est_abundance.py -i kk2-trial02.report -o trial02.bracken.def -k /home/Staff/uqgni1/tools/kraken2-silva138/16S_SILVA138_k2db/database100mers.kmer_distrib -l S -t 1000
PROGRAM START TIME: 07-02-2020 21:06:49
BRACKEN SUMMARY (Kraken report: kk2-trial02.report)
    >>> Threshold: 1000
    >>> Number of species in sample: 0
      >> Number of species with reads > threshold: 0
      >> Number of species with reads < threshold: 0
    >>> Total reads in sample: 100000
      >> Total reads kept at species level (reads > threshold): 0
      >> Total reads discarded (species reads < threshold): 0
      >> Reads distributed: 0
      >> Reads not distributed (eg. no species above threshold): 100000
      >> Unclassified reads: 0
BRACKEN OUTPUT PRODUCED: trial02.bracken.def
PROGRAM END TIME: 07-02-2020 21:06:49
  Bracken complete.

I'm guess this can be due to the kmer_distrib files were only generated for 50-250? How to adjust this workflow to work with full length 16S reads?

BTW, the kraken2 report seems fine though:

head kk2-trial02.report
100.00  100000  0   R   1   root
100.00  100000  33  D   3     Bacteria
 91.63  91632   2   P   2375        Proteobacteria
 91.06  91055   624 C   3303          Gammaproteobacteria
 58.64  58636   0   O   26732           Nitrosococcales
 58.62  58625   31  F   26737             Nitrosococcaceae
 58.56  58557   58557   G   26741               Candidatus Nitrosoglobus
  0.03  26  26  G   26746               Nitrosococcus
  0.00  4   4   G   26749               wb1-P19
  0.00  3   3   G   26740               CI75cm.2.12

Kind regards, Gaofeng

Wrzlprmft commented 4 years ago

What happens when you add the option -l G to your bracken call?

ganiatgithub commented 4 years ago

Thanks,-l G worked. None of the reads were classified to species level, is this normal for 16S reads that are 1.2-1.6 kb long? I notice that in the bracken-build run, ideal read length is specified, which is not possible for my case. Will this be an issue?

heather340 commented 4 years ago

Hi! In reference to your first error message (no library) in case anyone else has this issue: when you're building the kraken2 database, don't clean it (kraken2-build --clean --db silva), as this actually removes the library folder that Bracken needs to be able to run.

Now as to your second error you're getting, I'm having the same issue. My understanding is that setting the threshold at 1000 means that any species that has less than 1000 reads (in this case, most of them since the database is mainly single 16S loci) won't receive any additional reads when calculating the abundance distribution. However, even if I set it to a threshold of 1, then I don't get any results at the species level. I'd really like to get some results at the species level... I do get 1,060,046 reads kept at the class level, and 101,604 at the genus level with a threshold of 0.

Is there any way to remove the threshold?

Thanks! Heather

jenniferlu717 commented 4 years ago

@heather340 is completely correct. The threshold you set is too high. The threshold (with a species level estimation) tells Bracken to ignore any species that have < 1000 reads (which I assume is all of them). I would suggest a lower threshold of maybe 10-50.