ValueError: max() arg is an empty sequence

rjsorr commented 2 years ago

Having multiple databases that won't build, get the same error for all.

cd /media/ubuntu/Elements/reference_genomes/ganon conda activate python3.7_environment rm -r EUK_refseq_ALL_db mkdir EUK_refseq_ALL_db cd EUK_refseq_ALL_db ganon build -t 30 --db-prefix EUK_refseq_ALL_db --input-directory /media/ubuntu/Elements/reference_genomes/genome_updater/euk_refseq_ALL/24_12_2021/files --input-extension genomic.fna.gz > out.log 2>&1 & cd ..

(_	(_			(_)		_	v. 1.1.0

1346 file(s) [genomic.fna.gz] found in /media/ubuntu/Elements/reference_genomes/genome_updater/euk_refseq_ALL/24_12_2021/files

Downloading taxdump

done in 7.02s.

Unpacking taxdump

done in 3.97s.

Parsing taxonomy

done in 5.64s.

Extracting sequence identifiers

13354133 unique sequence headers successfully retrieved from 1346 input file(s)
done in 24040.90s.

Extracting sequence lengths

13354133 sequences lenghts successfully retrieved
done in 15867.69s.

Downloading nucl_gb.accession2taxid.gz

done in 315.38s.

Downloading nucl_wgs.accession2taxid.gz

done in 326.77s.

Parsing accession2taxid files

13051429 entries found in the nucl_gb.accession2taxid.gz file
0 entries found in the nucl_wgs.accession2taxid.gz file
done in 17115.37s.

Build: adding 13354133 sequences

Simulating parameters Traceback (most recent call last): File "/home/ubuntu/miniconda2/envs/python3.7_environment/bin/ganon", line 33, in sys.exit(load_entry_point('ganon==1.1.0', 'console_scripts', 'ganon')()) File "/home/ubuntu/miniconda2/envs/python3.7_environment/lib/python3.7/site-packages/ganon/ganon.py", line 48, in main_cli sys.exit(0 if main() else 1) File "/home/ubuntu/miniconda2/envs/python3.7_environment/lib/python3.7/site-packages/ganon/ganon.py", line 34, in main ret=build(cfg) File "/home/ubuntu/miniconda2/envs/python3.7_environment/lib/python3.7/site-packages/ganon/build_update.py", line 70, in build bin_length = estimate_bin_length(cfg, seqinfo, tax) File "/home/ubuntu/miniconda2/envs/python3.7_environment/lib/python3.7/site-packages/ganon/build_update.py", line 625, in estimate_bin_length max_bin_len = max(groups_len.values()) ValueError: max() arg is an empty sequence

pirovc commented 2 years ago

Hi @rjsorr, this is a bit harder to debug since I don't have your sequences or identifiers. I suggest you to run the build again with --write-seq-info-file. That way you can send me that file to debug the issue. In the next run you could use this same file (with --seq-info-file) to skip the most time consuming steps of the build (downloading and extracting identifiers which took some hours).

rjsorr commented 2 years ago

https://we.tl/t-5XmAbMSFdQ

too large to attach :)

pirovc commented 2 years ago

I managed to replicate and find the bug and it will be fixed in the next release. The file you sent me is actually broken due to the bug and should not be used again as I mentioned before. However, you can still use the current version of ganon to build your databases, adding --seq-info-mode nucl_gb to your command:

ganon build -t 30 --db-prefix EUK_refseq_ALL_db --input-directory /media/ubuntu/Elements/reference_genomes/genome_updater/euk_refseq_ALL/24_12_2021/files --input-extension genomic.fna.gz --write-seq-info-file --seq-info-mode nucl_gb > out.log 2>&1 &

rjsorr commented 2 years ago

cheers! will give this a try. btw great if you can look at #185 and answer how this is done when providing more than a single database?

rjsorr commented 2 years ago

gave the above suggestion a try and got the same error. Will wait for the update. regards

pirovc commented 2 years ago

Thanks again for reporting! Indeed there was a following bug. Everything was fixed on version 1.1.2 #195. It should be up later today on bioconda. Feel free to re-open this issue if the problem persists.

rjsorr commented 2 years ago

worked on a small dataset. failed on a larger @pirovc :

cd /media/ubuntu/Elements/reference_genomes/ganon conda activate python3.7_environment rm -r EUK_refseq_ALL_db mkdir EUK_refseq_ALL_db cd EUK_refseq_ALL_db ganon build -t 20 --db-prefix EUK_refseq_ALL_db --input-directory /media/ubuntu/Elements/reference_genomes/genome_updater/euk_refseq_ALL/24_12_2021/files --input-extension genomic.fna.gz --write-seq-info-file > out.log 2>&1 & cd ..

(_	(_			(_)		_	v. 1.1.2

1346 file(s) [genomic.fna.gz] found in /media/ubuntu/Elements/reference_genomes/genome_updater/euk_refseq_ALL/24_12_2021/files

Downloading taxdump

done in 7.11s.

Unpacking taxdump

done in 3.76s.

Parsing taxonomy

done in 5.51s.

Extracting sequence identifiers

13354133 unique sequence headers successfully retrieved from 1346 input file(s)
done in 24974.50s.

Extracting sequence lengths

13354133 sequences lenghts successfully retrieved
done in 16707.05s.

Downloading nucl_gb.accession2taxid.gz

done in 315.68s.

Downloading nucl_wgs.accession2taxid.gz

done in 327.62s.

Parsing accession2taxid files

13051429 entries found in the nucl_gb.accession2taxid.gz file
0 entries found in the nucl_wgs.accession2taxid.gz file
could not retrieve taxid for 302704 accessions
done in 18028.29s.

Build: adding 13051429 sequences

Simulating parameters Traceback (most recent call last): File "/home/ubuntu/miniconda2/envs/python3.7_environment/bin/ganon", line 33, in sys.exit(load_entry_point('ganon==1.1.2', 'console_scripts', 'ganon')()) File "/home/ubuntu/miniconda2/envs/python3.7_environment/lib/python3.7/site-packages/ganon/ganon.py", line 48, in main_cli sys.exit(0 if main() else 1) File "/home/ubuntu/miniconda2/envs/python3.7_environment/lib/python3.7/site-packages/ganon/ganon.py", line 34, in main ret=build(cfg) File "/home/ubuntu/miniconda2/envs/python3.7_environment/lib/python3.7/site-packages/ganon/build_update.py", line 70, in build bin_length = estimate_bin_length(cfg, seqinfo, tax) File "/home/ubuntu/miniconda2/envs/python3.7_environment/lib/python3.7/site-packages/ganon/build_update.py", line 648, in estimate_bin_length params = estimate_params(cfg, simulated_bin_lens, groups_len) File "/home/ubuntu/miniconda2/envs/python3.7_environment/lib/python3.7/site-packages/ganon/build_update.py", line 603, in estimate_params n_bins = optimal_bins(estimate_n_bins(bin_length, cfg.overlap_length, groups_len)) File "/home/ubuntu/miniconda2/envs/python3.7_environment/lib/python3.7/site-packages/ganon/build_update.py", line 580, in estimate_n_bins n_bins = sum([math.ceil(math.ceil(l/(frag_len-overlap_len))/(bin_len/(frag_len+overlap_len))) for l in groups_len.values()]) File "/home/ubuntu/miniconda2/envs/python3.7_environment/lib/python3.7/site-packages/ganon/build_update.py", line 580, in n_bins = sum([math.ceil(math.ceil(l/(frag_len-overlap_len))/(bin_len/(frag_len+overlap_len))) for l in groups_len.values()]) ZeroDivisionError: division by zero

pirovc commented 2 years ago

Hi @rjsorr. I think now you may have some empty sequences in your files causing this. Was the EUK_refseq_ALL_db.seqinfo.txt generated? Does it have 3 columns? if yes could you please uploaded it to me again?

rjsorr commented 2 years ago

generated and 3 columns, yes. Attached: https://we.tl/t-Ft0yGhiXOZ

pirovc commented 2 years ago

Thanks again for the file! your sequences are all okay, it's another bug. There's already a fix for the next release but you can still use the v1.1.2 and skip it by using: --overlap-length 301. You can also re-use the seqinfo file to speed up the building, the command should look something like:

ganon build -t 20 --db-prefix EUK_refseq_ALL_db --input-directory /media/ubuntu/Elements/reference_genomes/genome_updater/euk_refseq_ALL/24_12_2021/files --input-extension genomic.fna.gz --seq-info-file EUK_refseq_ALL_db.seqinfo.txt --overlap-length 301 > out.log 2>&1

FYI this will generate a very very big index file (>800GB). Maybe you should consider using minimizers. Let me know if you need any help with that.

rjsorr commented 2 years ago

@pirovc thanks again! decided against this database as I think fungi is enough for my purposes. That said, I'm using Archaea, Bacteria, Virus, and Fungi All from refseq as databases with 300 input files of 3gb data each, so quite large. So minimizers does seem like a good suggestion to reduce memory usage and speed up the classifying. As such, I wondering what you suggest as a "good" value for --window-size and --offset to reduce memory usage and increas speed with as minimal impact as possible on sensitivity and precision? My databases were built with the default --kmer-size setting, but I can rebuild with another if needs be?

regards

pirovc commented 2 years ago

I didn't do an extensive evaluation yet but I've been using --window-size 32 with the default --kmer-size 19 and I'm getting good results with minimal sensitivity loss, tested mainly with bacterial genomes. Just keep in mind that muiltiple databses have to be build with same parameters to be used at the same time.

The offset will not reduce any memory and cannot be used at the same time with the window size. If you prefer this option, you'll get some speed-up on the classification with your current set-up. Here I'd suggest to use --offset 2 or 3.

rjsorr commented 2 years ago

@pirovc cheers. So just to clarify if I "build" all my databases again with --window-size 32, as you suggest, I cannot then change the "classify" --offset from its default value? If so, I guess you would suggest the --window-size as the better parameter to change, if I had to choose?

pirovc commented 2 years ago

Exactly, either one or the other and I would go with the --window-size because it reduces memory usage and speeds up classification. Notice that minimizers were only introduced in version 1.1.0 as an optional parameter. In the future releases it will be the default option and it will have better support, so it makes sense to go with it.

rjsorr commented 2 years ago

Again thank you for the answer @pirovc, I'll rebuild the databases as suggested. BTW, do you have any recommendations for changing the "classify" parameters to increase % classified reads with minimal effect on sensitivty? At present I get between 20-60% classified with default "classify" parameters.

pirovc commented 2 years ago

This will depend on your data and similarity to the database. Take a look here to understand and choose the parameters for classify. If you have any further doubt, please open a new issue so we don't mix subjects here. Cheers.

pirovc / ganon

ValueError: max() arg is an empty sequence #192