sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
472 stars 80 forks source link

ValueError: mismatch in scaled; comparison fail when using sourmash search on a signature with scaled=1000 and db with scaled=2000 #1804

Closed taylorreiter closed 2 years ago

taylorreiter commented 2 years ago

I'm confused, because I thought that sourmash would automatically subsample a signature with lower scaled value than the db, or if not automatically, at least downsample when I explicitly set the scaled.

command:

sourmash search --max-containment -o {output} --scaled 2000 -k 21 {input.sig} {input.db}

Output/error message:

== This is sourmash version 4.2.3. ==
^MESC[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

^MESC[Kselecting specified query k=21
^MESC[Kloaded query: SRX4624095... (k=21, DNA)
^MESC[Kdownsampling query from scaled=1000 to 2000
^MESC[Kloading from inputs/sourmash_dbs/gtdb-rs202.genomic.k21.zip...^MESC[K                                                                               ^MESC[Kloaded 1 databases.

Traceback (most recent call last):
  File "/home/tereiter/github/2022-microberna/.snakemake/conda/9d444dee7b279e826dd8b6a0a96f0fd9/bin/sourmash", line 11, in <module>
    sys.exit(main())
  File "/home/tereiter/github/2022-microberna/.snakemake/conda/9d444dee7b279e826dd8b6a0a96f0fd9/lib/python3.10/site-packages/sourmash/__main__.py", line 13, in main
    return mainmethod(args)
  File "/home/tereiter/github/2022-microberna/.snakemake/conda/9d444dee7b279e826dd8b6a0a96f0fd9/lib/python3.10/site-packages/sourmash/cli/search.py", line 103, in main
    return sourmash.commands.search(args)
  File "/home/tereiter/github/2022-microberna/.snakemake/conda/9d444dee7b279e826dd8b6a0a96f0fd9/lib/python3.10/site-packages/sourmash/commands.py", line 485, in search
    results = search_databases_with_abund_query(query, databases,
  File "/home/tereiter/github/2022-microberna/.snakemake/conda/9d444dee7b279e826dd8b6a0a96f0fd9/lib/python3.10/site-packages/sourmash/search.py", line 215, in search_databases_with_abund_query
    search_iter = db.search_abund(query, **kwargs)
  File "/home/tereiter/github/2022-microberna/.snakemake/conda/9d444dee7b279e826dd8b6a0a96f0fd9/lib/python3.10/site-packages/sourmash/index.py", line 202, in search_abund
    score = query.similarity(subj)
  File "/home/tereiter/github/2022-microberna/.snakemake/conda/9d444dee7b279e826dd8b6a0a96f0fd9/lib/python3.10/site-packages/sourmash/sign
ature.py", line 136, in similarity
    return self.minhash.similarity(other.minhash,
  File "/home/tereiter/github/2022-microberna/.snakemake/conda/9d444dee7b279e826dd8b6a0a96f0fd9/lib/python3.10/site-packages/sourmash/minhash.py", line 663, in similarity
    return self._methodcall(lib.kmerminhash_similarity,
  File "/home/tereiter/github/2022-microberna/.snakemake/conda/9d444dee7b279e826dd8b6a0a96f0fd9/lib/python3.10/site-packages/sourmash/utils.py", line 25, in _methodcall
    return rustcall(func, self._get_objptr(), *args)
  File "/home/tereiter/github/2022-microberna/.snakemake/conda/9d444dee7b279e826dd8b6a0a96f0fd9/lib/python3.10/site-packages/sourmash/utils.py", line 78, in rustcall
    raise exc
ValueError: mismatch in scaled; comparison fail
ctb commented 2 years ago

it should! smells like a bug.

what happens when you run

sourmash sig downsample --scaled 2000

on both the input.sig and the database?

ctb commented 2 years ago

oh, you're using abundance queries. much less well tested.

taylorreiter commented 2 years ago

If I run sourmash sig downsample on input.sig, it downsamples to scaled 2000. Then, when I rerun the search command, I get the same error message.

I don't really want to run in on the database...its a linked file of /group/ctbrowngrp/gtdb/databases/gtdb-rs202.genomic.k21.zip (started running it, but it got killed :D)

ctb commented 2 years ago

On Wed, Jan 19, 2022 at 09:49:03AM -0800, Taylor Reiter wrote:

If I run sourmash sig downsample on input.sig, it downsamples to scaled 2000. Then, when I rerun the search command, I get the same error message.

I don't really want to run in on the database...its a linked file of /group/ctbrowngrp/gtdb/databases/gtdb-rs202.genomic.k21.zip (started running it, but it got killed :D)

kk. will look into it.

ctb commented 2 years ago

(if you wanted to debug in the meantime, you could use picklists to extract a few signatures from the database with sig cat, and then put them in a new db to generate an example set that should also fail)

ctb commented 2 years ago

fixed in https://github.com/sourmash-bio/sourmash/pull/1820