sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
474 stars 80 forks source link

sbt_search error: mismatch in max hash; comparison fail #146

Closed brooksph closed 7 years ago

brooksph commented 7 years ago

When using sbt_search with microbes.k31.sbt.json sourmash spits out an error. I calculated a signature for a single genome using sourmash compute -k 31 --scaled 10000 *fna.gz. The output of this works fine with sbt_gather but fails with sbt_search.

../dev/sourmash sbt_search -k 31 --dna microbes.k31.sbt.json ../fasta/herpetosiphon_geysericola.fa.gz.sig --threshold=0.001 --best-only

# running sourmash subcommand: sbt_search
loaded query: herpetosiphon_geysericola.fa.g... (k=31, DNA)
Traceback (most recent call last):
  File "/Users/philliptbrooks/dev/sourmash/sourmash", line 5, in <module>
    main()
  File "/Users/philliptbrooks/dev/sourmash/sourmash_lib/__main__.py", line 56, in main
    cmd(sys.argv[2:])
  File "/Users/philliptbrooks/dev/sourmash/sourmash_lib/commands.py", line 572, in sbt_search
    for leaf in tree.find(search_fn, query, args.threshold):
  File "/Users/philliptbrooks/dev/sourmash/sourmash_lib/sbt.py", line 155, in find
    if search_fn(node_g, *args):
  File "/Users/philliptbrooks/dev/sourmash/sourmash_lib/sbtmh.py", line 37, in search_minhashes
    matches = node.data.estimator.count_common(sig.estimator)
  File "/Users/philliptbrooks/dev/sourmash/sourmash_lib/__init__.py", line 179, in count_common
    return self.mh.count_common(other.mh)
  File "sourmash_lib/_minhash.pyx", line 100, in sourmash_lib._minhash.MinHash.count_common (sourmash_lib/_minhash.cpp:2698)
ValueError: mismatch in max_hash; comparison fail
ctb commented 7 years ago

hmm, I thought I'd calculated the k-31 signatures using --with-cardinality. Can you extract the leaf node that is causing the problem and look at its YAML to see if the cardinality is in there? Check with @luizirber on the simplest way to extract the YAML filename.

ctb commented 7 years ago

bump.

brooksph commented 7 years ago

@ctb I think this is a bug in sbt_search unless sbt_search and sbt_gather use scaled signatures in a different manner. I'm using the same sbt with sbt_search and sbt_gather. When I calculate a signature with --scaled 10000 and use sbt_search I get an error but sbt_gather returns the best match as expected. In contrast, if I don't calculate the signature with scaled sbt_search returns several matches. Same behavior with the -k 21, 31, and 51 sbt's. I will poke @luizirber about this.

ctb commented 7 years ago

sounds reasonable. can you build a small test case (i.e. using only a few .sig files and putting them in an SBT) that generates the error?

I bet if you calculate the signatures with --with-cardinality and then build an SBT with them, sbt_search with a --scaled signature will fail and sbt_gather will succeed.

brooksph commented 7 years ago

@ctb worked just like that. Also, gather doesn't work when the query signatures are calculated with cardinality. Notebook https://github.com/brooksph/2017_sbt_search_project/blob/master/dev/Untitled.ipynb and signatures https://github.com/brooksph/2017_sbt_search_project/tree/master/dev

ctb commented 7 years ago

Hi @brooksph could you check out #183 and see if it works on the full microbes SBT? thx

jolespin commented 5 years ago

I'm getting a similar error. How can I fix this?

(metagenomics_env) -bash-4.1$ rm scaffolds.drop_bacteria.fasta.sig; sourmash compute --scaled 2000 --dna --seed 0 scaffolds.drop_bacteria.fasta
== This is sourmash version 2.0.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

setting num_hashes to 0 because --scaled is set
computing signatures for files: scaffolds.drop_bacteria.fasta
Computing signature for ksizes: [21, 31, 51]
Computing only nucleotide (and not protein) signatures.
Computing a total of 3 signature(s).
... reading sequences from scaffolds.drop_bacteria.fasta
calculated 3 signatures for 738 sequences in scaffolds.drop_bacteria.fasta
saved 3 signature(s). Note: signature license is CC0.
(metagenomics_env) -bash-4.1$ sourmash search -k 21 -o sourmash_output scaffolds.drop_bacteria.fasta.sig /usr/local/scratch/METAGENOMICS/jespinoz/db/sourmash_db/genbank-d2-k21.sbt.json --threshold 0
== This is sourmash version 2.0.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded query: scaffolds.drop_bacteria.fasta... (k=21, DNA)
loaded 1 databases.

Traceback (most recent call last):
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/metagenomics_env/bin/sourmash", line 11, in <module>
    sys.exit(main())
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/metagenomics_env/lib/python2.7/site-packages/sourmash/__main__.py", line 83, in main
    cmd(sys.argv[2:])
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/metagenomics_env/lib/python2.7/site-packages/sourmash/commands.py", line 857, in search
    args.best_only, args.ignore_abundance)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/metagenomics_env/lib/python2.7/site-packages/sourmash/search.py", line 62, in search_databases
    for leaf in tree.find(search_fn, tree_query, threshold):
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/metagenomics_env/lib/python2.7/site-packages/sourmash/sbt.py", line 213, in find
    if search_fn(node_g, *args):
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/metagenomics_env/lib/python2.7/site-packages/sourmash/sbtmh.py", line 119, in search_minhashes
    score = node.data.minhash.similarity(sig.minhash)
  File "sourmash/_minhash.pyx", line 359, in sourmash._minhash.MinHash.similarity
  File "sourmash/_minhash.pyx", line 338, in sourmash._minhash.MinHash.jaccard
  File "sourmash/_minhash.pyx", line 333, in sourmash._minhash.MinHash.compare
  File "sourmash/_minhash.pyx", line 321, in sourmash._minhash.MinHash.intersection
ValueError: mismatch in seed; comparison fail
ctb commented 5 years ago

Hi Josh,

thank you for such a complete error report!!

The '--seed' argument for compute needs to match the seed used to construct the signatures in the database - that's 42, per mash default.

There's no reason to set the seed to anything in normal operations, note.

best, --titus