Closed rjsorr closed 2 years ago
Hi @rjsorr, this is a bit harder to debug since I don't have your sequences or identifiers. I suggest you to run the build again with --write-seq-info-file
. That way you can send me that file to debug the issue. In the next run you could use this same file (with --seq-info-file
) to skip the most time consuming steps of the build (downloading and extracting identifiers which took some hours).
too large to attach :)
I managed to replicate and find the bug and it will be fixed in the next release. The file you sent me is actually broken due to the bug and should not be used again as I mentioned before. However, you can still use the current version of ganon to build your databases, adding --seq-info-mode nucl_gb
to your command:
ganon build -t 30 --db-prefix EUK_refseq_ALL_db --input-directory /media/ubuntu/Elements/reference_genomes/genome_updater/euk_refseq_ALL/24_12_2021/files --input-extension genomic.fna.gz --write-seq-info-file --seq-info-mode nucl_gb > out.log 2>&1 &
cheers! will give this a try. btw great if you can look at #185 and answer how this is done when providing more than a single database?
gave the above suggestion a try and got the same error. Will wait for the update. regards
Thanks again for reporting! Indeed there was a following bug. Everything was fixed on version 1.1.2 #195. It should be up later today on bioconda. Feel free to re-open this issue if the problem persists.
worked on a small dataset. failed on a larger @pirovc :
cd /media/ubuntu/Elements/reference_genomes/ganon conda activate python3.7_environment rm -r EUK_refseq_ALL_db mkdir EUK_refseq_ALL_db cd EUK_refseq_ALL_db ganon build -t 20 --db-prefix EUK_refseq_ALL_db --input-directory /media/ubuntu/Elements/reference_genomes/genome_updater/euk_refseq_ALL/24_12_2021/files --input-extension genomic.fna.gz --write-seq-info-file > out.log 2>&1 & cd ..
(_ | (_ | (_) | _ | v. 1.1.2 |
---|
1346 file(s) [genomic.fna.gz] found in /media/ubuntu/Elements/reference_genomes/genome_updater/euk_refseq_ALL/24_12_2021/files
Downloading taxdump
Unpacking taxdump
Parsing taxonomy
Extracting sequence identifiers
Extracting sequence lengths
Downloading nucl_gb.accession2taxid.gz
Downloading nucl_wgs.accession2taxid.gz
Parsing accession2taxid files
Build: adding 13051429 sequences
Simulating parameters
Traceback (most recent call last):
File "/home/ubuntu/miniconda2/envs/python3.7_environment/bin/ganon", line 33, in
Hi @rjsorr. I think now you may have some empty sequences in your files causing this. Was the EUK_refseq_ALL_db.seqinfo.txt
generated? Does it have 3 columns? if yes could you please uploaded it to me again?
generated and 3 columns, yes. Attached: https://we.tl/t-Ft0yGhiXOZ
Thanks again for the file! your sequences are all okay, it's another bug. There's already a fix for the next release but you can still use the v1.1.2 and skip it by using: --overlap-length 301
. You can also re-use the seqinfo file to speed up the building, the command should look something like:
ganon build -t 20 --db-prefix EUK_refseq_ALL_db --input-directory /media/ubuntu/Elements/reference_genomes/genome_updater/euk_refseq_ALL/24_12_2021/files --input-extension genomic.fna.gz --seq-info-file EUK_refseq_ALL_db.seqinfo.txt --overlap-length 301 > out.log 2>&1
FYI this will generate a very very big index file (>800GB). Maybe you should consider using minimizers. Let me know if you need any help with that.
@pirovc thanks again! decided against this database as I think fungi is enough for my purposes. That said, I'm using Archaea, Bacteria, Virus, and Fungi All from refseq as databases with 300 input files of 3gb data each, so quite large. So minimizers does seem like a good suggestion to reduce memory usage and speed up the classifying. As such, I wondering what you suggest as a "good" value for --window-size and --offset to reduce memory usage and increas speed with as minimal impact as possible on sensitivity and precision? My databases were built with the default --kmer-size setting, but I can rebuild with another if needs be?
regards
I didn't do an extensive evaluation yet but I've been using --window-size 32
with the default --kmer-size 19
and I'm getting good results with minimal sensitivity loss, tested mainly with bacterial genomes. Just keep in mind that muiltiple databses have to be build with same parameters to be used at the same time.
The offset will not reduce any memory and cannot be used at the same time with the window size. If you prefer this option, you'll get some speed-up on the classification with your current set-up. Here I'd suggest to use --offset 2
or 3
.
@pirovc cheers. So just to clarify if I "build" all my databases again with --window-size 32, as you suggest, I cannot then change the "classify" --offset from its default value? If so, I guess you would suggest the --window-size as the better parameter to change, if I had to choose?
Exactly, either one or the other and I would go with the --window-size
because it reduces memory usage and speeds up classification. Notice that minimizers were only introduced in version 1.1.0 as an optional parameter. In the future releases it will be the default option and it will have better support, so it makes sense to go with it.
Again thank you for the answer @pirovc, I'll rebuild the databases as suggested. BTW, do you have any recommendations for changing the "classify" parameters to increase % classified reads with minimal effect on sensitivty? At present I get between 20-60% classified with default "classify" parameters.
Having multiple databases that won't build, get the same error for all.
cd /media/ubuntu/Elements/reference_genomes/ganon conda activate python3.7_environment rm -r EUK_refseq_ALL_db mkdir EUK_refseq_ALL_db cd EUK_refseq_ALL_db ganon build -t 30 --db-prefix EUK_refseq_ALL_db --input-directory /media/ubuntu/Elements/reference_genomes/genome_updater/euk_refseq_ALL/24_12_2021/files --input-extension genomic.fna.gz > out.log 2>&1 & cd ..
1346 file(s) [genomic.fna.gz] found in /media/ubuntu/Elements/reference_genomes/genome_updater/euk_refseq_ALL/24_12_2021/files
Downloading taxdump
Unpacking taxdump
Parsing taxonomy
Extracting sequence identifiers
Extracting sequence lengths
Downloading nucl_gb.accession2taxid.gz
Downloading nucl_wgs.accession2taxid.gz
Parsing accession2taxid files
Build: adding 13354133 sequences
Simulating parameters Traceback (most recent call last): File "/home/ubuntu/miniconda2/envs/python3.7_environment/bin/ganon", line 33, in
sys.exit(load_entry_point('ganon==1.1.0', 'console_scripts', 'ganon')())
File "/home/ubuntu/miniconda2/envs/python3.7_environment/lib/python3.7/site-packages/ganon/ganon.py", line 48, in main_cli
sys.exit(0 if main() else 1)
File "/home/ubuntu/miniconda2/envs/python3.7_environment/lib/python3.7/site-packages/ganon/ganon.py", line 34, in main
ret=build(cfg)
File "/home/ubuntu/miniconda2/envs/python3.7_environment/lib/python3.7/site-packages/ganon/build_update.py", line 70, in build
bin_length = estimate_bin_length(cfg, seqinfo, tax)
File "/home/ubuntu/miniconda2/envs/python3.7_environment/lib/python3.7/site-packages/ganon/build_update.py", line 625, in estimate_bin_length
max_bin_len = max(groups_len.values())
ValueError: max() arg is an empty sequence