sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

`sourmash gather` raises `AssertionError` on minimal test dataset #2825

Open vmikk opened 10 months ago

vmikk commented 10 months ago

Dear sourmash team,

I've recently attempted to create a minimal test dataset and encountered an error when using sourmash gather. Specifically, it raises an AssertionError. I think that the error might be due to the presence of very few sketches in common between the sample and the database. The error appears even with --threshold-bp 0.

Expected Behavior: While this may be a marginal case, I believe the expected behavior in this scenario is for sourmash gather to return an empty CSV file with no results, and subsequently an exit code of 0.

Actual Behavior: I've got an AssertionError when running the command.
Sourmash fails after Prefetch and "Doing gather to generate minimum metagenome cover".

Additional Information:

Here is the full output log:

   == This is sourmash version 4.8.4. ==
   == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

   select query k=31 automatically.
   loaded query: ERR5766174_se... (k=31, DNA)
   loading from 'sourmash-db.zip'...
   --
   loaded 6 total signatures from 1 locations.
   after selecting signatures compatible with search, 6 remain.
   Starting prefetch sweep across databases.
   Prefetch found 6 signatures with overlap >= 0 bp.
   Doing gather to generate minimum metagenome cover.

   overlap     p_query p_match
   ---------   ------- -------
   25.5 kbp       0.1%    0.7%    CYXV01000001.1
   Traceback (most recent call last):
     File "/usr/local/bin/sourmash", line 11, in <module>
       sys.exit(main())
                ^^^^^^
     File "/usr/local/lib/python3.11/site-packages/sourmash/__main__.py", line 19, in main
       retval = mainmethod(args)
                ^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.11/site-packages/sourmash/cli/gather.py", line 162, in main
       return sourmash.commands.gather(args)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.11/site-packages/sourmash/commands.py", line 830, in gather
       for result in gather_iter:
     File "/usr/local/lib/python3.11/site-packages/sourmash/search.py", line 754, in __next__
       best_result, intersect_mh = _find_best(counters, query, threshold_bp)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.11/site-packages/sourmash/search.py", line 640, in _find_best
       result = counter.peek(query.minhash, threshold_bp=threshold_bp)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.11/site-packages/sourmash/index/__init__.py", line 818, in peek
       assert cont
   AssertionError

The command I executed was:

    sourmash gather \
      --threshold-bp 0 \
      --output ERR5766174_se.csv.gz \
      ERR5766174_se.sig \
      sourmash-db.zip

I've uploaded the test data here:

   curl -J -O "https://owncloud.ut.ee/owncloud/s/xHE76xLr9jrDeqX/download?path=%2F&files=ERR5766174_se.sig&downloadStartSecret=wxqvxr5285r"
   curl -J -O "https://owncloud.ut.ee/owncloud/s/xHE76xLr9jrDeqX/download?path=%2F&files=sourmash-db.zip&downloadStartSecret=pfgmp5y8i4a"

Thank you for providing such a valuable tool! With kind regards, Vladimir

PS. As this was a small test dataset, I used scaled=10 to generate sample signatures, and the database signatures were also created with a different scale setting.
Given that it works correctly with a larger database, the issue likely isn't related to the scaling difference.

ctb commented 10 months ago

Thanks! I see the same error with commit 720962cf304, the latest latest - the detailed test case really helps 😅 . More soon!

ctb commented 10 months ago

I'm getting some intriguing errors with

sourmash gather ERR5766174_se.sig sourmash-db.zip --threshold-bp=0 --save-prefetch-csv prefetch.csv

as well. It seems likely that there is a bug related to handling of many different scaled values -- here the query metagenome is at scaled of 10, while the database has three different scaled values of 1, 20, and 1600 (!?):

% sourmash sig summarize ERR5766174_se.sig

== This is sourmash version 4.8.5.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'ERR5766174_se.sig'
path filetype: MultiIndex
location: ERR5766174_se.sig
is database? no
has manifest? yes
num signatures: 1
** examining manifest...
total hashes: 4526202
summary of sketches:
   1 sketches with DNA, k=31, scaled=10               4526202 total hashes
% sourmash sig summarize sourmash-db.zip

== This is sourmash version 4.8.5.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'sourmash-db.zip'
path filetype: ZipFileLinearIndex
location: /Users/t/dev/sourmash/failgather/sourmash-db.zip
is database? yes
has manifest? yes
num signatures: 6
** examining manifest...
total hashes: 768424
summary of sketches:
   4 sketches with DNA, k=31, scaled=20               735189 total hashes
   1 sketches with DNA, k=31, scaled=1                16508 total hashes
   1 sketches with DNA, k=31, scaled=1600             16727 total hashes
vmikk commented 10 months ago

Yes, scaling is a bit weird in this case :sweat_smile:. It was done this way because the test samples vary in size, and with uniform scaling, one sample ended up with too few hashes (so maybe it was better to take a fixed number of hashes for each sample).

However, these sketches work well with the larger database (GTDB), so I believed the issue might not be related to the different scaling.

ctb commented 9 months ago

Sorry, I'm still tracking this down! I understand where the problem lies but not exactly why it's happening - working on it over in #2832.

But I wanted to respond to this:

It was done this way because the test samples vary in size, and with uniform scaling, one sample ended up with too few hashes (so maybe it was better to take a fixed number of hashes for each sample).

Note that for comparison purposes, everything needs to be at the same scaled, so all samples are scaled "up" to the highest scaled value needed for the comparison. This operation is always possible on FracMinHash sketches (just like reducing n is always possible on MinHash sketches). So, while you can create sketches with different scaled values, it doesn't mean that they will all be used in a specific comparison. We try to output good warning messages on this front but may not always succeed.

There is a brief technical discussion of this in the internals guide, and a more complete (but academic) discussion in Irber et al., 2022.