Closed brooksph closed 7 years ago
hmm, I thought I'd calculated the k-31 signatures using --with-cardinality. Can you extract the leaf node that is causing the problem and look at its YAML to see if the cardinality is in there? Check with @luizirber on the simplest way to extract the YAML filename.
bump.
@ctb I think this is a bug in sbt_search unless sbt_search and sbt_gather use scaled signatures in a different manner. I'm using the same sbt with sbt_search and sbt_gather. When I calculate a signature with --scaled 10000 and use sbt_search I get an error but sbt_gather returns the best match as expected. In contrast, if I don't calculate the signature with scaled sbt_search returns several matches. Same behavior with the -k 21, 31, and 51 sbt's. I will poke @luizirber about this.
sounds reasonable. can you build a small test case (i.e. using only a few .sig files and putting them in an SBT) that generates the error?
I bet if you calculate the signatures with --with-cardinality
and then build an SBT with them, sbt_search
with a --scaled
signature will fail and sbt_gather
will succeed.
@ctb worked just like that. Also, gather doesn't work when the query signatures are calculated with cardinality. Notebook https://github.com/brooksph/2017_sbt_search_project/blob/master/dev/Untitled.ipynb and signatures https://github.com/brooksph/2017_sbt_search_project/tree/master/dev
Hi @brooksph could you check out #183 and see if it works on the full microbes SBT? thx
I'm getting a similar error. How can I fix this?
(metagenomics_env) -bash-4.1$ rm scaffolds.drop_bacteria.fasta.sig; sourmash compute --scaled 2000 --dna --seed 0 scaffolds.drop_bacteria.fasta
== This is sourmash version 2.0.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
setting num_hashes to 0 because --scaled is set
computing signatures for files: scaffolds.drop_bacteria.fasta
Computing signature for ksizes: [21, 31, 51]
Computing only nucleotide (and not protein) signatures.
Computing a total of 3 signature(s).
... reading sequences from scaffolds.drop_bacteria.fasta
calculated 3 signatures for 738 sequences in scaffolds.drop_bacteria.fasta
saved 3 signature(s). Note: signature license is CC0.
(metagenomics_env) -bash-4.1$ sourmash search -k 21 -o sourmash_output scaffolds.drop_bacteria.fasta.sig /usr/local/scratch/METAGENOMICS/jespinoz/db/sourmash_db/genbank-d2-k21.sbt.json --threshold 0
== This is sourmash version 2.0.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
loaded query: scaffolds.drop_bacteria.fasta... (k=21, DNA)
loaded 1 databases.
Traceback (most recent call last):
File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/metagenomics_env/bin/sourmash", line 11, in <module>
sys.exit(main())
File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/metagenomics_env/lib/python2.7/site-packages/sourmash/__main__.py", line 83, in main
cmd(sys.argv[2:])
File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/metagenomics_env/lib/python2.7/site-packages/sourmash/commands.py", line 857, in search
args.best_only, args.ignore_abundance)
File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/metagenomics_env/lib/python2.7/site-packages/sourmash/search.py", line 62, in search_databases
for leaf in tree.find(search_fn, tree_query, threshold):
File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/metagenomics_env/lib/python2.7/site-packages/sourmash/sbt.py", line 213, in find
if search_fn(node_g, *args):
File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/metagenomics_env/lib/python2.7/site-packages/sourmash/sbtmh.py", line 119, in search_minhashes
score = node.data.minhash.similarity(sig.minhash)
File "sourmash/_minhash.pyx", line 359, in sourmash._minhash.MinHash.similarity
File "sourmash/_minhash.pyx", line 338, in sourmash._minhash.MinHash.jaccard
File "sourmash/_minhash.pyx", line 333, in sourmash._minhash.MinHash.compare
File "sourmash/_minhash.pyx", line 321, in sourmash._minhash.MinHash.intersection
ValueError: mismatch in seed; comparison fail
Hi Josh,
thank you for such a complete error report!!
The '--seed' argument for compute needs to match the seed used to construct the signatures in the database - that's 42, per mash default.
There's no reason to set the seed to anything in normal operations, note.
best, --titus
When using sbt_search with microbes.k31.sbt.json sourmash spits out an error. I calculated a signature for a single genome using sourmash compute -k 31 --scaled 10000 *fna.gz. The output of this works fine with sbt_gather but fails with sbt_search.
../dev/sourmash sbt_search -k 31 --dna microbes.k31.sbt.json ../fasta/herpetosiphon_geysericola.fa.gz.sig --threshold=0.001 --best-only