jenniferlu717 / Bracken

Bracken (Bayesian Reestimation of Abundance with KrakEN) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample.
http://ccb.jhu.edu/software/bracken/index.shtml
GNU General Public License v3.0
282 stars 50 forks source link

Errors in step0 kraken2-build and step1: bracken-build on conda installation #110

Open mutantjoo0 opened 4 years ago

mutantjoo0 commented 4 years ago

Hello Jennifer (@jenniferlu717),

I am new to Bracken and Kraken. I installed them on conda environment as a new environment as follows:

# packages in environment at /mnt/home/leejooy5/miniconda3/envs/bracken:
#
# Name                    Version                   Build  Channel
bracken                   2.6.0            py37h9a982cc_2    bioconda

(bracken) -bash-4.2$ conda list kraken2
# packages in environment at /mnt/home/leejooy5/miniconda3/envs/bracken:
#
# Name                    Version                   Build  Channel
kraken2                   2.0.9beta       pl526hc9558a2_0    bioconda

Following the manual, I did step 0: build a kraken database-standard and bacteria. The followings are outputs in step 0.

(bracken) -bash-4.2$ ls -lh kraken_standard_db/
total 6.6M
drwxr-s--- 7 leejooy5 Reguera_Kashefi_Lab 8.0K Aug 12 04:36 library
-rw-r----- 1 leejooy5 Reguera_Kashefi_Lab 4.2M Aug 12 04:36 seqid2taxid.map
-rw-r----- 1 leejooy5 Reguera_Kashefi_Lab 2.4M Aug 12 14:23 taxo.k2d.tmp
drwxr-s--- 2 leejooy5 Reguera_Kashefi_Lab 8.0K Aug 12 04:36 taxonomy
(bracken) -bash-4.2$ ls -lh kraken_bacteria_db/
total 2.9M
drwxr-s--- 3 leejooy5 Reguera_Kashefi_Lab 8.0K Aug  8 00:19 library
-rw-r----- 1 leejooy5 Reguera_Kashefi_Lab 1.8M Aug  8 17:36 seqid2taxid.map
-rw-r----- 1 leejooy5 Reguera_Kashefi_Lab 1.1M Aug  8 17:43 taxo.k2d.tmp
drwxr-s--- 2 leejooy5 Reguera_Kashefi_Lab 8.0K Aug  8 17:36 taxonomy

One of my concern when I was doing step0 was I got a sort of warning terminated by signal 13, both in kraken standard database and kraken bacteria database.

Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 15 bits reserved for taxid.
Processed 2498 sequences (6284101310 bp)...xargs: cat: terminated by signal 13
/mnt/home/leejooy5/miniconda3/envs/bracken/libexec/build_kraken2_db.sh: line 133: 26972 Done                    list_sequence_files
     26973 Exit 125                | xargs -0 cat
     26974 Killed                  | build_db -k $KRAKEN2_KMER_LEN -l $KRAKEN2_MINIMIZER_LEN -S $KRAKEN2_SEED_TEMPLATE $KRAKEN2XFLAG -H hash.k2d.tmp -t taxo.k2d.tmp -o opts.k2d.tmp -n taomy/ -m $seqid2taxid_map_file -c $required_capacity -p $KRAKEN2_THREAD_CT $max_db_flag

Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 14 bits reserved for taxid.
Processed 2436 sequences (4167489052 bp)...xargs: cat: terminated by signal 13
/mnt/home/leejooy5/miniconda3/envs/bracken/libexec/build_kraken2_db.sh: line 133: 21990 Done                    list_sequence_files
     21991 Exit 125                | xargs -0 cat
     21992 Killed                  | build_db -k $KRAKEN2_KMER_LEN -l $KRAKEN2_MINIMIZER_LEN -S $KRAKEN2_SEED_TEMPLATE $KRAKEN2XFLAG -H hash.k2d.tmp -t taxo.k2d.tmp -o opts.k2d.tmp -n taxonomy/ -m $seqid2taxid_map_file -c $required_capacity -p $KRAKEN2_THREAD_CT $max_db_flag

When I was trying to do step1:bracken-build, I got these errors:

(bracken) -bash-4.2$ ls ~/miniconda3/envs/bracken/bin/kraken2-build
/mnt/home/leejooy5/miniconda3/envs/bracken/bin/kraken2-build
(bracken) -bash-4.2$ bracken-build -d kraken_standard_db/ -t 30 -x ~/miniconda3/envs/bracken/bin/kraken2-build
 >> Selected Options:
       kmer length = 35
       read length = 100
       database    = kraken_standard_db/
       threads     = 30
User must first install kraken or kraken2 and/or specify installation directory of kraken/kraken2 using -x flag

(bracken) -bash-4.2$ ls ~/miniconda3/envs/bracken/libexec/kraken2-build
/mnt/home/leejooy5/miniconda3/envs/bracken/libexec/kraken2-build
(bracken) -bash-4.2$ bracken-build -d kraken_standard_db/ -t 30 -x ~/miniconda3/envs/bracken/libexec/kraken2-build
 >> Selected Options:
       kmer length = 35
       read length = 100
       database    = kraken_standard_db/
       threads     = 30
User must first install kraken or kraken2 and/or specify installation directory of kraken/kraken2 using -x flag

I also tried to use copied kraken2-build script in work directory, but it did not work.

(bracken) -bash-4.2$ cp /mnt/home/leejooy5/miniconda3/envs/bracken/libexec/kraken2-build .
(bracken) -bash-4.2$ ls
kraken2-build  kraken_bacteria_db  kraken_standard_db
(bracken) -bash-4.2$ bracken-build -d kraken_standard_db/ -t 30 -x .
 >> Selected Options:
       kmer length = 35
       read length = 100
       database    = kraken_standard_db/
       threads     = 30
User must first install kraken or kraken2 and/or specify installation directory of kraken/kraken2 using -x flag

Alternatively, I tried to follow step#1a-c, but it gave me another error:

(bracken) -bash-4.2$ kraken2 --db=$kraken_standard_db --threads=10 <( find -L $kraken_standard_db/library \( -name "*.fna" -o -name "*.fasta" -o -name "*.fa" \) -exec cat {} + ) > database.kraken
find: ‘/library’: No such file or directory
Option db requires an argument
kraken2: Must specify DB with either --db or $KRAKEN2_DEFAULT_DB

(bracken) -bash-4.2$ kraken2 --db=kraken_standard_db --threads=10 <( find -L $kraken_standard_db/library \( -name "*.fna" -o -name "*.fasta" -o -name "*.fa" \) -exec cat {} + ) > database.kraken
find: ‘/library’: No such file or directory
kraken2: database ("./kraken_standard_db") does not contain necessary file taxo.k2d

Collectively I assumed that I should have taxo.k2d instead of taxo.k2d.tmp,which seems like temporary file, from the step0. The problem is I repeatedly getting same results: processing step0 completed saying xargs: cat: terminated by signal 13 and resulted taxo.k2d.tmp in output/database directory. Could you help me to fix this issue?

Thanks, Joo-Young

jenniferlu717 commented 4 years ago

What did you run for step0? None of the Bracken steps will work unless you can build the kraken2 database:

kraken2-build --build --db=kraken_standard_db --threads 10

From there, it should generate taxo.k2d, hash.k2d, and opts.k2d. Without all three, Bracken will not work.

mutantjoo0 commented 4 years ago

Hi Jennifer,

Thank you for your prompt response and support. My command and process used to build standard database are shown below. I am not sure whether this process properly completed or not.

(bracken) -bash-4.2$ kraken2-build --standard --db kraken_standard_db/ --threads 30 --use-ftp
Step 1/2: Performing ftp file transfer of requested files
Step 2/2: Assigning taxonomic IDs to sequences
Processed 390 projects (604 sequences, 1.02 Gbp)... done.
All files processed, cleaning up extra sequence files... done, library complete.
Masking low-complexity regions of downloaded library... done.
Step 1/2: Performing ftp file transfer of requested files
Step 2/2: Assigning taxonomic IDs to sequences
Processed 20868 projects (45891 sequences, 84.98 Gbp)... done.
All files processed, cleaning up extra sequence files... done, library complete.
Masking low-complexity regions of downloaded library... done.
Step 1/2: Performing ftp file transfer of requested files
Step 2/2: Assigning taxonomic IDs to sequences
Processed 10379 projects (13002 sequences, 386.51 Mbp)... done.
All files processed, cleaning up extra sequence files... done, library complete.
Masking low-complexity regions of downloaded library... done.
Step 1/2: Performing ftp file transfer of requested files
Step 2/2: Assigning taxonomic IDs to sequences
Processed 1 project (639 sequences, 3.27 Gbp)... done.
All files processed, cleaning up extra sequence files... done, library complete.
Downloading UniVec_Core data from server... done.
Adding taxonomy ID of 28384 to all sequences... done.
Masking low-complexity regions of downloaded library... done.
Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map complete. [0.443s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 48971338604 bytes
Capacity estimation complete. [9m48.607s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 15 bits reserved for taxid.
Processed 2498 sequences (6284101310 bp)...xargs: cat: terminated by signal 13
/mnt/home/leejooy5/miniconda3/envs/bracken/libexec/build_kraken2_db.sh: line 133: 26972 Done                    list_sequence_files
     26973 Exit 125                | xargs -0 cat
     26974 Killed                  | build_db -k $KRAKEN2_KMER_LEN -l $KRAKEN2_MINIMIZER_LEN -S $KRAKEN2_SEED_TEMPLATE $KRAKEN2XFLAG -H hash.k2d.tmp -t taxo.k2d.tmp -o opts.k2d.tmp -n taomy/ -m $seqid2taxid_map_file -c $required_capacity -p $KRAKEN2_THREAD_CT $max_db_flag

Previously I got another error without --use-ftp option:

(bracken) -bash-4.2$ kraken2-build --standard --db kraken_standard_db --threads 24
Downloading taxonomy tree data...rsync: error while loading shared libraries: libiconv.so.2: cannot open shared object file: No such file or directory
jenniferlu717 commented 4 years ago

It looks like the database downloaded fine but didnt build fine. You dont need to run that command again but you will need to run:

kraken2-build --build --db kraken_standard_db --threads 24 again (no worries about --use-ftp)

I'm not 100% sure why it broke but how much RAM is in your system?

mutantjoo0 commented 4 years ago

I tried again and it terminated by signal 13 and I found taxo.k2d.tmp file as shown below.

(bracken) -bash-4.2$ kraken2-build --standard --db standard --threads 24 --use-ftp
Downloading taxonomy tree data... done.
Untarring taxonomy tree data... done.
Step 1/2: Performing ftp file transfer of requested files
Step 2/2: Assigning taxonomic IDs to sequences
Processed 390 projects (604 sequences, 1.02 Gbp)... done.
All files processed, cleaning up extra sequence files... done, library complete.
Masking low-complexity regions of downloaded library... done.
Step 1/2: Performing ftp file transfer of requested files
Step 2/2: Assigning taxonomic IDs to sequences
Processed 20970 projects (46145 sequences, 85.39 Gbp)... done.
All files processed, cleaning up extra sequence files... done, library complete.
Masking low-complexity regions of downloaded library... done.
Step 1/2: Performing ftp file transfer of requested files
rsync_from_ncbi.pl: unable to download all/GCF/000/849/405/GCF_000849405.1_ViralProj14717/GCF_000849405.1_ViralProj14717_genomic.fna.gz: Idle timeout (60 seconds): closing control connection

rsync_from_ncbi.pl: unable to download all/GCF/000/915/475/GCF_000915475.1_ViralProj239432/GCF_000915475.1_ViralProj239432_genomic.fna.gz: [Net::FTP] Connection closed
rsync_from_ncbi.pl: unable to download all/GCF/004/128/475/GCF_004128475.1_ASM412847v1/GCF_004128475.1_ASM412847v1_genomic.fna.gz: [Net::FTP] Connection closed
.
.
.
Processed 10388 projects (7511 sequences, 220.28 Mbp)... done.
All files processed, cleaning up extra sequence files... done, library complete.
Masking low-complexity regions of downloaded library... done.
Step 1/2: Performing ftp file transfer of requested files
Step 2/2: Assigning taxonomic IDs to sequences
Processed 1 project (639 sequences, 3.27 Gbp)... done.
All files processed, cleaning up extra sequence files... done, library complete.
Downloading UniVec_Core data from server... done.
Adding taxonomy ID of 28384 to all sequences... done.
Masking low-complexity regions of downloaded library... done.
Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map complete. [0.245s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 49068412340 bytes
Capacity estimation complete. [9m54.634s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 15 bits reserved for taxid.
Processed 2858 sequences (7503328991 bp)...xargs: cat: terminated by signal 13
/mnt/home/leejooy5/miniconda3/envs/bracken/libexec/build_kraken2_db.sh: line 133: 25862 Done                    list_sequence_files
     25863 Exit 125                | xargs -0 cat
     25864 Killed                  | build_db -k $KRAKEN2_KMER_LEN -l $KRAKEN2_MINIMIZER_LEN -S $KRAKEN2_SEED_TEMPLATE $KRAKEN2XFLAG -H hash.k2d.tmp -t taxo.k2d.tmp -o opts.k2d.tmp -n taxonomy/ -m $seqid2taxid_map_file -c $required_capacity -p $KRAKEN2_THREAD_CT $max_db_flag

(bracken) -bash-4.2$ ls standard/
library  seqid2taxid.map  taxo.k2d.tmp  taxonomy

I am using hpc dev node which has 377G memory. After running, I checked as follows:

(bracken) -bash-4.2$ free -mh
              total        used        free      shared  buff/cache   available
Mem:           377G        289G         78G        460M        9.8G         86G
Swap:            0B          0B          0B

Should I have to run kraken2-build --build --db kraken_standard_db --threads 24 or start over from kraken2-build --standard --db NAME?

jenniferlu717 commented 4 years ago

I think you only need to run kraken2-build --build but Im not sure if your system has enough memory, which might be causing the error. It might be having some trouble. Can you try building with max-db-size 30000000?

mutantjoo0 commented 4 years ago

Hi Jennifer,

kraken2-build --build --db NAME --threads 24 failed as shown below. Now I am running again with --max-db-size 30000000. I will keep post here once I get results.

(bracken) -bash-4.2$ kraken2-build --build --db kraken_standard_db_failed/ --threads 24
Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map already present, skipping map creation.
Estimating required capacity (step 2)...
Estimated hash table requirement: 48971338604 bytes
Capacity estimation complete. [10m43.179s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 15 bits reserved for taxid.
Processed 2978 sequences (7343292609 bp)...xargs: cat: terminated by signal 13
/mnt/home/leejooy5/miniconda3/envs/bracken/libexec/build_kraken2_db.sh: line 133: 40223 Done                    list_sequence_files
     40224 Exit 125                | xargs -0 cat
     40225 Killed                  | build_db -k $KRAKEN2_KMER_LEN -l $KRAKEN2_MINIMIZER_LEN -S $KRAKEN2_SEED_TEMPLATE $KRAKEN2XFLAG -H hash.k2d.tmp -t taxo.k2d.tmp -o opts.k2d.tmp -n taxonomy/ -m $seqid2taxid_map_file -c $required_capacity -p $KRAKEN2_THREAD_CT $max_db_flag

(bracken) -bash-4.2$ ls kraken_standard_db_failed/
library  seqid2taxid.map  taxo.k2d.tmp  taxonomy
mutantjoo0 commented 4 years ago

Hi Jennifer,

Thanks to your support, I could complete kraken database construction.

(bracken) -bash-4.2$ kraken2-build --build --db kraken_standard_db_failed/ --max-db-size 30000000
Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map already present, skipping map creation.
Estimating required capacity (step 2)...
Estimated hash table requirement: 48971338604 bytes
Specifying lower maximum hash table size of 30000000 bytes
Capacity estimation complete. [44m53.800s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 15 bits reserved for taxid.
Completed processing of 63271 sequences, 89655529216 bp
Writing data to disk...  complete.
Database files completed. [48m31.302s]
Database construction complete. [Total: 1h33m25.172s]

(bracken) -bash-4.2$ du -sh kraken_standard_db_failed/
(bracken) -bash-4.2$ ls -lht *standard*
kraken_standard_db:
total 36M
-rw-r----- 1 leejooy5 Reguera_Kashefi_Lab   56 Aug 13 16:58 opts.k2d
-rw-r----- 1 leejooy5 Reguera_Kashefi_Lab  29M Aug 13 16:58 hash.k2d
-rw-r----- 1 leejooy5 Reguera_Kashefi_Lab 2.4M Aug 13 16:09 taxo.k2d
-rw-r----- 1 leejooy5 Reguera_Kashefi_Lab 4.2M Aug 12 04:36 seqid2taxid.map
drwxr-s--- 2 leejooy5 Reguera_Kashefi_Lab 8.0K Aug 12 04:36 taxonomy
drwxr-s--- 7 leejooy5 Reguera_Kashefi_Lab 8.0K Aug 12 04:36 library
(bracken) -bash-4.2$ mv kraken_standard_db_failed/ kraken_standard_db/

I found kraken2-build --clean --db NAME remove library directory which is required to process bracken-build. Fortunately, I made a backup of db and started running bracken-build. I will post the summary of result and time required to bracken-build. Thanks again for your great support.

Stay safe and healthy, Joo-Young

mutantjoo0 commented 4 years ago

Running bracken-build took 10 min and it was successful.

(bracken) -bash-4.2$ bracken-build -d standard_db -t 16
>> Selected Options:
       kmer length = 35
       read length = 100
       database    = standard_db
       threads     = 16
 >> Checking for Valid Options...
 >> Creating database.kraken [if not found]
      >> kraken2 --db standard_db --threads 16 <( find -L standard_db/library \( -name *.fna -o -name *.fa -o -name *.fasta \) -exec cat {} + ) > standard_db/database.kraken
Loading database information... done.
63273 sequences (89655.55 Mbp) processed in 284.993s (13.3 Kseq/m, 18875.35 Mbp/m).
  46160 sequences classified (72.95%)
  17113 sequences unclassified (27.05%)
          Finished creating database.kraken [in DB folder]
 >> Creating database100mers.kmer_distrib
        >>STEP 0: PARSING COMMAND LINE ARGUMENTS
                Taxonomy nodes file: standard_db/taxonomy/nodes.dmp
                Seqid file:          standard_db/seqid2taxid.map
                Num Threads:         16
                Kmer Length:         35
                Read Length:         100
        >>STEP 1: READING SEQID2TAXID MAP
                109763 total sequences read
        >>STEP 2: READING NODES.DMP FILE
                2266594 total nodes read
        >>STEP 3: CONVERTING KMER MAPPINGS INTO READ CLASSIFICATIONS:
                100mers, with a database built using 35mers
                63290 sequences converted (finished: kraken:taxid|1269028|NC_020104.1)1)1.1)25)59)))49)))
        Time Elaped: 2 minutes, 34 seconds, 0.00000 microseconds
        =============================
PROGRAM START TIME: 08-13-2020 21:40:21
...19879 total genomes read from kraken output file
...creating kmer counts file -- lists the number of kmers of each classification per genome
...creating kmer distribution file -- lists genomes and kmer counts contributing to each genome
PROGRAM END TIME: 08-13-2020 21:40:22
          Finished creating database100mers.kraken and database100mers.kmer_distrib [in DB folder]
          *NOTE: to create read distribution files for multiple read lengths,
                 rerun this script specifying the same database but a different read length

Bracken build complete.

(bracken) -bash-4.2$ ls -lht standard_db/
total 738M
-rw-r----- 1 leejooy5 Reguera_Kashefi_Lab 2.2M Aug 13 17:40 database100mers.kmer_distrib
-rw-r----- 1 leejooy5 Reguera_Kashefi_Lab 5.3M Aug 13 17:40 database100mers.kraken
-rw-r----- 1 leejooy5 Reguera_Kashefi_Lab 695M Aug 13 17:37 database.kraken
drwxr-s--- 2 leejooy5 Reguera_Kashefi_Lab 8.0K Aug 13 17:15 taxonomy
drwxr-s--- 7 leejooy5 Reguera_Kashefi_Lab 8.0K Aug 13 17:15 library
-rw-r----- 1 leejooy5 Reguera_Kashefi_Lab  29M Aug 13 17:14 hash.k2d
-rw-r----- 1 leejooy5 Reguera_Kashefi_Lab 2.4M Aug 13 17:14 taxo.k2d
-rw-r----- 1 leejooy5 Reguera_Kashefi_Lab 4.2M Aug 13 17:14 seqid2taxid.map
-rw-r----- 1 leejooy5 Reguera_Kashefi_Lab   56 Aug 13 17:14 opts.k2d

Thanks!