Open tnmquann opened 3 weeks ago
Thanks @tnmquann for this very detailed issue! Looking into it now.
First question, which is unrelated to the problem you're experiencing, I think, but I wanted to ask - you shouldn't need to unzip manysketch.zip
into a directory. You should be able to run
# Solution 1 - OK
sourmash scripts fastmultigather --cores 20 /mnt/data/tnmquann/benchmarking/12_experiment/manysketch.zip /mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207/gtdb-rs207.genomic-reps.dna.k31.zip
directly, without the unzip and the use of SOURMASH-MANIFEST.csv
. (It's actually kind of cool that running it on SOURMASH-MANIFEST.csv
works, incidentally! But it should be unnecessary!)
OK, I can replicate the problem with fastmultigather
on my laptop. Not sure why I wasn't running into it before...
In brief,
# within directory `rocks-index`:
sourmash scripts index fake-metag.sig.zip -o fake-metag.rocksdb
ls -1 ../2.fa.sig > query.txt
sourmash scripts fastmultigather query.txt fake-metag.rocksdb
# works fine
# go to another directory
mkdir ../rocks2
cd ../rocks2
ln -s ../rocks-index/fake-metag.rocksdb .
# fails:
sourmash scripts check fake-metag.rocksdb
# fails:
ls -1 ../2.fa.sig > query.txt
sourmash scripts fastmultigather query.txt fake-metag.rocksdb
Thanks @tnmquann for this very detailed issue! Looking into it now.
First question, which is unrelated to the problem you're experiencing, I think, but I wanted to ask - you shouldn't need to unzip
manysketch.zip
into a directory. You should be able to run# Solution 1 - OK sourmash scripts fastmultigather --cores 20 /mnt/data/tnmquann/benchmarking/12_experiment/manysketch.zip /mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207/gtdb-rs207.genomic-reps.dna.k31.zip
directly, without the unzip and the use of
SOURMASH-MANIFEST.csv
. (It's actually kind of cool that running it onSOURMASH-MANIFEST.csv
works, incidentally! But it should be unnecessary!)
Hi @ctb Thanks for your question :D. Actually, I decompress manysketch.zip file for two main reasons:
ValueError: Expected exactly one signature with ksize 31 in /mnt/data/quantnm/benchmarking/12_experiment/sketches/manysketch.zip, found 81. Likely you will need to do something like: sourmash sig merge /mnt/data/quantnm/benchmarking/12_experiment/sketches/manysketch.zip -o <new signature with just one sketch in it>.
So I have another workaround for this problem: I decompress the resulting file from the manysketch module and "try" to reconstruct the .sig.zip files individually (it seems this tool uses the module multisearch to run for each* sample separately). The results show that this workaround is quite good :D. The only problem I'm facing is that this combination takes a lot of time (that's why I want to use the fastmultigather module with rocksdb to decrease data processing time).
I'd be happy to discuss further if you have any questions.
thanks! no, that all makes sense. And we should talk to the YACHT authors (with whom we are quite friendly ;)) about updating their code!
This was rapidly turning into a heisenbug for me, so I brute-forced it and wrote a script to explore -
tl;dr RocksDB indexes built from .zip files FAIL when referenced from other directories, while RocksDB indexes built from lists of files work fine!
(@bluegenes may owe me a drink because it was so hard to nail down this problem!)
#! /bin/bash
set -e
set -x
rm -fr foo1 foo2 foo3
mkdir foo1
cd foo1
ls -1 ../{1,2,3,4,5,6,7,8,9}.fa.sig > list.txt
sourmash sig cat ../{1,2,3,4,5,6,7,8,9}.fa.sig -k 31 -o list.sig.zip
sourmash sig merge -k 31 ../{1,2,3}.fa.sig -o fake-metag.sig.gz
sourmash scripts index list.txt -o foo-from-list.db
sourmash scripts index list.sig.zip -o foo-from-zip.db
sourmash scripts check foo-from-list.db
sourmash scripts check foo-from-zip.db
sourmash scripts fastmultigather fake-metag.sig.gz foo-from-list.db -o out.csv
sourmash scripts fastmultigather fake-metag.sig.gz foo-from-zip.db -o out.csv
###
cd ../
mkdir foo2
cd foo2
cp ../foo1/fake-metag.sig.gz .
sourmash scripts check ../foo1/foo-from-list.db
sourmash scripts check ../foo1/foo-from-zip.db ## this fails!
A more succinct version
#! /bin/bash
set -e
set -x
rm -fr foo5 list.txt list.sig.zip
ls -1 {1,2,3,4,5,6,7,8,9}.fa.sig > list.txt
sourmash sig cat {1,2,3,4,5,6,7,8,9}.fa.sig -k 31 -o list.sig.zip
mkdir foo5
sourmash scripts index list.txt -o foo5/foo-from-list.db
sourmash scripts index list.sig.zip -o foo5/foo-from-zip.db
sourmash scripts check foo5/foo-from-list.db
cd foo5
sourmash scripts check foo-from-list.db
cd ../
sourmash scripts check foo5/foo-from-zip.db
cd foo5
sourmash scripts check foo-from-zip.db # this breaks
OK, it looks like by default the rocksdb does not store the sketches internally, and what is happening is that the path to the zip file containing sketches is being interpreted problematically. 🤿 time.
Hi @ctb , Currently I'm using these commands:
Prepare data
Solution 3: fastmultigather with rocksdb
Output
== This is sourmash version 4.8.10. == == Please cite Irber et. al (2024), doi:10.21105/joss.06830. == => sourmash_plugin_branchwater 0.9.5; cite Irber et al., doi: 10.1101/2022.11.02.514947 ksize: 31 / scaled: 1000 / moltype: DNA / threshold bp: 50000 gathering all sketches in 'SOURMASH-MANIFEST.csv' against '/mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207/gtdb-rs207.genomic-reps.dna.k31.rocksdb' using 20 threads Error: No such file or directory (os error 2)
cd /mnt/data/tnmquann/benchmarking/12_experiment cp -r /mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207/gtdb-rs207.genomic-reps.dna.k31.rocksdb /mnt/data/tnmquann/benchmarking/12_experiment/manysketch sourmash scripts fastmultigather SOURMASH-MANIFEST.csv gtdb-rs207.genomic-reps.dna.k31.rocksdb -c 20 -o gather.csv
Output
== This is sourmash version 4.8.10. == == Please cite Irber et. al (2024), doi:10.21105/joss.06830. == => sourmash_plugin_branchwater 0.9.5; cite Irber et al., doi: 10.1101/2022.11.02.514947 ksize: 31 / scaled: 1000 / moltype: DNA / threshold bp: 50000 gathering all sketches in 'SOURMASH-MANIFEST.csv' against '/mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207/gtdb-rs207.genomic-reps.dna.k31.rocksdb' using 20 threads Error: No such file or directory (os error 2)
Try to re-check the copied rocksdb
sourmash scripts check gtdb-rs207.genomic-reps.dna.k31.rocksdb
Output
== This is sourmash version 4.8.10. == == Please cite Irber et. al (2024), doi:10.21105/joss.06830. == checking index 'gtdb-rs207.genomic-reps.dna.k31.rocksdb' Opening DB Error: No such file or directory (os error 2)
cd /mnt/data/tnmquann/benchmarking/12_experiment/manysketch
Symlink
ln -s /mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207/gtdb-rs207.genomic-reps.dna.k31.rocksdb .
sourmash scripts fastmultigather SOURMASH-MANIFEST.csv gtdb-rs207.genomic-reps.dna.k31.rocksdb -c 20 -o gather.csv
Output
== This is sourmash version 4.8.10. == == Please cite Irber et. al (2024), doi:10.21105/joss.06830. == => sourmash_plugin_branchwater 0.9.5; cite Irber et al., doi: 10.1101/2022.11.02.514947 ksize: 31 / scaled: 1000 / moltype: DNA / threshold bp: 50000 gathering all sketches in 'SOURMASH-MANIFEST.csv' against 'gtdb-rs207.genomic-reps.dna.k31.rocksdb' using 20 threads Error: No such file or directory (os error 2)
Try to re-check the copied rocksdb
sourmash scripts check gtdb-rs207.genomic-reps.dna.k31.rocksdb
Output
== This is sourmash version 4.8.10. == == Please cite Irber et. al (2024), doi:10.21105/joss.06830. == checking index 'gtdb-rs207.genomic-reps.dna.k31.rocksdb' Opening DB Error: No such file or directory (os error 2)
cd /mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207 cp gtdb-rs207.genomic-reps.dna.k31.zip /mnt/data/tnmquann/benchmarking/12_experiment/manysketch
Index database
sourmash scripts index gtdb-rs207.genomic-reps.dna.k31.zip -o gtdb-rs207.genomic-reps.dna.k31.rocksdb -c 30
Check indexed database
sourmash scripts check gtdb-rs207.genomic-reps.dna.k31.rocksdb
Output
== This is sourmash version 4.8.10. == == Please cite Irber et. al (2024), doi:10.21105/joss.06830. == checking index 'gtdb-rs207.genomic-reps.dna.k31.rocksdb' Opening DB Starting check Finished check ...index is ok!
Re-run fastmultigather
sourmash scripts fastmultigather SOURMASH-MANIFEST.csv gtdb-rs207.genomic-reps.dna.k31.rocksdb -c 20 -o gather.csv
Output is OK