sourmash-bio / sourmash_plugin_branchwater

fast, multithreaded sourmash operations: search, compare, and gather.
GNU Affero General Public License v3.0
14 stars 2 forks source link

Error when use fastmultigather against rocksdb (Error: No such file or directory (os error 2) - Tested with multiple cases) #381

Open tnmquann opened 3 weeks ago

tnmquann commented 3 weeks ago

Hi @ctb , Currently I'm using these commands:

Prepare data

cd /mnt/data/tnmquann/benchmarking/12_experiment
# Step 1: sourmash manysketch
sourmash scripts manysketch manysketch.csv -o manysketch.zip -c 20 -p k=31,scaled=1000,abund

# Step 2: unzip the manysketch.zip (Notes: I used this folder for all the commands below)
unzip manysketch.zip -d manysketch

# Additional: index gtdb-rs207.genomic-reps.dna.k31.zip
cd /mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207
sourmash scripts index gtdb-rs207.genomic-reps.dna.k31.zip -o gtdb-rs207.genomic-reps.dna.k31.rocksdb -c 30

# Check indexed database
sourmash scripts check gtdb-rs207.genomic-reps.dna.k31.rocksdb

# Output
== This is sourmash version 4.8.10. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==
checking index 'gtdb-rs207.genomic-reps.dna.k31.rocksdb'
Opening DB
Starting check
Finished check
...index is ok!

I tried many different solutions and got the following results

Solution 1 & 2: Work perfectly

cd /mnt/data/tnmquann/benchmarking/12_experiment
# Solution 1 - OK
sourmash scripts fastmultigather --cores 20 /mnt/data/tnmquann/benchmarking/12_experiment/manysketch/SOURMASH-MANIFEST.csv /mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207/gtdb-rs207.genomic-reps.dna.k31.zip
# Solution 2 - OK (use loop + parallel package to run this script for each sample)
# Recreate *.sig.zip for each samples, then use fastgather
sourmash scripts fastgather /mnt/data/tnmquann/benchmarking/12_experiment/zip/trimmed-SRR17380114.sig.zip /mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207/gtdb-rs207.genomic-reps.dna.k31.zip -c 20 -o trimmed-SRR17380114.csv

Both methods above do the job perfectly, except for solution 3 below (fastmultigather with rocksdb)

Solution 3: fastmultigather with rocksdb

Currently, the feature is only available when the database is indexed directly into the processing folder.

Solution 3.1 : Use the path to the indexed database


cd /mnt/data/tnmquann/benchmarking/12_experiment
sourmash scripts fastmultigather SOURMASH-MANIFEST.csv gtdb-rs207.genomic-reps.dna.k31.rocksdb -c 20 -o gather.csv

Output

== This is sourmash version 4.8.10. == == Please cite Irber et. al (2024), doi:10.21105/joss.06830. == => sourmash_plugin_branchwater 0.9.5; cite Irber et al., doi: 10.1101/2022.11.02.514947 ksize: 31 / scaled: 1000 / moltype: DNA / threshold bp: 50000 gathering all sketches in 'SOURMASH-MANIFEST.csv' against '/mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207/gtdb-rs207.genomic-reps.dna.k31.rocksdb' using 20 threads Error: No such file or directory (os error 2)

## Solution 3.2: Copy indexed database into the processing folder and then run the commands

cd /mnt/data/tnmquann/benchmarking/12_experiment cp -r /mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207/gtdb-rs207.genomic-reps.dna.k31.rocksdb /mnt/data/tnmquann/benchmarking/12_experiment/manysketch sourmash scripts fastmultigather SOURMASH-MANIFEST.csv gtdb-rs207.genomic-reps.dna.k31.rocksdb -c 20 -o gather.csv

Output

== This is sourmash version 4.8.10. == == Please cite Irber et. al (2024), doi:10.21105/joss.06830. == => sourmash_plugin_branchwater 0.9.5; cite Irber et al., doi: 10.1101/2022.11.02.514947 ksize: 31 / scaled: 1000 / moltype: DNA / threshold bp: 50000 gathering all sketches in 'SOURMASH-MANIFEST.csv' against '/mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207/gtdb-rs207.genomic-reps.dna.k31.rocksdb' using 20 threads Error: No such file or directory (os error 2)

Try to re-check the copied rocksdb

sourmash scripts check gtdb-rs207.genomic-reps.dna.k31.rocksdb

Output

== This is sourmash version 4.8.10. == == Please cite Irber et. al (2024), doi:10.21105/joss.06830. == checking index 'gtdb-rs207.genomic-reps.dna.k31.rocksdb' Opening DB Error: No such file or directory (os error 2)

## Solution 3.3: Base on @ctb ‘s suggestion

cd /mnt/data/tnmquann/benchmarking/12_experiment/manysketch

Symlink

ln -s /mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207/gtdb-rs207.genomic-reps.dna.k31.rocksdb .

sourmash scripts fastmultigather SOURMASH-MANIFEST.csv gtdb-rs207.genomic-reps.dna.k31.rocksdb -c 20 -o gather.csv

Output

== This is sourmash version 4.8.10. == == Please cite Irber et. al (2024), doi:10.21105/joss.06830. == => sourmash_plugin_branchwater 0.9.5; cite Irber et al., doi: 10.1101/2022.11.02.514947 ksize: 31 / scaled: 1000 / moltype: DNA / threshold bp: 50000 gathering all sketches in 'SOURMASH-MANIFEST.csv' against 'gtdb-rs207.genomic-reps.dna.k31.rocksdb' using 20 threads Error: No such file or directory (os error 2)

Try to re-check the copied rocksdb

sourmash scripts check gtdb-rs207.genomic-reps.dna.k31.rocksdb

Output

== This is sourmash version 4.8.10. == == Please cite Irber et. al (2024), doi:10.21105/joss.06830. == checking index 'gtdb-rs207.genomic-reps.dna.k31.rocksdb' Opening DB Error: No such file or directory (os error 2)


## Solution 3.4: Base on @bluegenes 's suggestion

cd /mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207 cp gtdb-rs207.genomic-reps.dna.k31.zip /mnt/data/tnmquann/benchmarking/12_experiment/manysketch

Index database

sourmash scripts index gtdb-rs207.genomic-reps.dna.k31.zip -o gtdb-rs207.genomic-reps.dna.k31.rocksdb -c 30

Check indexed database

sourmash scripts check gtdb-rs207.genomic-reps.dna.k31.rocksdb

Output

== This is sourmash version 4.8.10. == == Please cite Irber et. al (2024), doi:10.21105/joss.06830. == checking index 'gtdb-rs207.genomic-reps.dna.k31.rocksdb' Opening DB Starting check Finished check ...index is ok!

Re-run fastmultigather

sourmash scripts fastmultigather SOURMASH-MANIFEST.csv gtdb-rs207.genomic-reps.dna.k31.rocksdb -c 20 -o gather.csv

Output is OK


I think there's a problem with the RocksDB folder configuration when running the index command. 
ctb commented 3 weeks ago

Thanks @tnmquann for this very detailed issue! Looking into it now.

First question, which is unrelated to the problem you're experiencing, I think, but I wanted to ask - you shouldn't need to unzip manysketch.zip into a directory. You should be able to run

# Solution 1 - OK
sourmash scripts fastmultigather --cores 20 /mnt/data/tnmquann/benchmarking/12_experiment/manysketch.zip /mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207/gtdb-rs207.genomic-reps.dna.k31.zip

directly, without the unzip and the use of SOURMASH-MANIFEST.csv. (It's actually kind of cool that running it on SOURMASH-MANIFEST.csv works, incidentally! But it should be unnecessary!)

ctb commented 3 weeks ago

OK, I can replicate the problem with fastmultigather on my laptop. Not sure why I wasn't running into it before...

In brief,

# within directory `rocks-index`:
sourmash scripts index fake-metag.sig.zip -o fake-metag.rocksdb
ls -1 ../2.fa.sig > query.txt
sourmash scripts fastmultigather query.txt fake-metag.rocksdb
# works fine

# go to another directory
mkdir ../rocks2
cd ../rocks2
ln -s ../rocks-index/fake-metag.rocksdb .

# fails:
sourmash scripts check fake-metag.rocksdb

# fails:
ls -1 ../2.fa.sig > query.txt
sourmash scripts fastmultigather  query.txt fake-metag.rocksdb
tnmquann commented 3 weeks ago

Thanks @tnmquann for this very detailed issue! Looking into it now.

First question, which is unrelated to the problem you're experiencing, I think, but I wanted to ask - you shouldn't need to unzip manysketch.zip into a directory. You should be able to run

# Solution 1 - OK
sourmash scripts fastmultigather --cores 20 /mnt/data/tnmquann/benchmarking/12_experiment/manysketch.zip /mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207/gtdb-rs207.genomic-reps.dna.k31.zip

directly, without the unzip and the use of SOURMASH-MANIFEST.csv. (It's actually kind of cool that running it on SOURMASH-MANIFEST.csv works, incidentally! But it should be unnecessary!)

Hi @ctb Thanks for your question :D. Actually, I decompress manysketch.zip file for two main reasons:

  1. My old experience trying to use the output from manysketch directly into fastmultigather in the old version was not really good (if I remember correctly, I made errors in v0.8.1, so I have to temporarily ignore this plugin). Note: I tried again in the newer version (v0.9.3+) and the error was fixed.
  2. I was doing benchmarking with yacht in my BSc thesis when I discovered that this tool is developed based on sourmash and sourmash_branchwater modules. I combined the results from yacht and sourmash to minimize the possibility of false positives, and the results were impressive. However, the problem occurred when I used manysketch.zip to use directly on the yacht, which resulted in the following error:
    ValueError: Expected exactly one signature with ksize 31 in /mnt/data/quantnm/benchmarking/12_experiment/sketches/manysketch.zip, found 81. Likely you will need to do something like: sourmash sig merge /mnt/data/quantnm/benchmarking/12_experiment/sketches/manysketch.zip -o <new signature with just one sketch in it>.

    So I have another workaround for this problem: I decompress the resulting file from the manysketch module and "try" to reconstruct the .sig.zip files individually (it seems this tool uses the module multisearch to run for each* sample separately). The results show that this workaround is quite good :D. The only problem I'm facing is that this combination takes a lot of time (that's why I want to use the fastmultigather module with rocksdb to decrease data processing time).

I'd be happy to discuss further if you have any questions.

ctb commented 3 weeks ago

thanks! no, that all makes sense. And we should talk to the YACHT authors (with whom we are quite friendly ;)) about updating their code!

ctb commented 2 weeks ago

This was rapidly turning into a heisenbug for me, so I brute-forced it and wrote a script to explore -

tl;dr RocksDB indexes built from .zip files FAIL when referenced from other directories, while RocksDB indexes built from lists of files work fine!

(@bluegenes may owe me a drink because it was so hard to nail down this problem!)

#! /bin/bash 
set -e
set -x

rm -fr foo1 foo2 foo3

mkdir foo1
cd foo1

ls -1 ../{1,2,3,4,5,6,7,8,9}.fa.sig > list.txt
sourmash sig cat ../{1,2,3,4,5,6,7,8,9}.fa.sig -k 31 -o list.sig.zip
sourmash sig merge -k 31 ../{1,2,3}.fa.sig -o fake-metag.sig.gz

sourmash scripts index list.txt -o foo-from-list.db
sourmash scripts index list.sig.zip -o foo-from-zip.db

sourmash scripts check foo-from-list.db
sourmash scripts check foo-from-zip.db

sourmash scripts fastmultigather fake-metag.sig.gz foo-from-list.db -o out.csv
sourmash scripts fastmultigather fake-metag.sig.gz foo-from-zip.db -o out.csv

### 

cd ../
mkdir foo2
cd foo2

cp ../foo1/fake-metag.sig.gz .
sourmash scripts check ../foo1/foo-from-list.db
sourmash scripts check ../foo1/foo-from-zip.db ## this fails!                   
ctb commented 2 weeks ago

A more succinct version

#! /bin/bash 
set -e
set -x

rm -fr foo5 list.txt list.sig.zip

ls -1 {1,2,3,4,5,6,7,8,9}.fa.sig > list.txt
sourmash sig cat {1,2,3,4,5,6,7,8,9}.fa.sig -k 31 -o list.sig.zip

mkdir foo5

sourmash scripts index list.txt -o foo5/foo-from-list.db
sourmash scripts index list.sig.zip -o foo5/foo-from-zip.db

sourmash scripts check foo5/foo-from-list.db
cd foo5
sourmash scripts check foo-from-list.db

cd ../
sourmash scripts check foo5/foo-from-zip.db
cd foo5
sourmash scripts check foo-from-zip.db # this breaks                            
ctb commented 2 weeks ago

OK, it looks like by default the rocksdb does not store the sketches internally, and what is happening is that the path to the zip file containing sketches is being interpreted problematically. 🤿 time.