Open vmikk opened 10 months ago
Thanks! I see the same error with commit 720962cf304, the latest latest
- the detailed test case really helps 😅 . More soon!
I'm getting some intriguing errors with
sourmash gather ERR5766174_se.sig sourmash-db.zip --threshold-bp=0 --save-prefetch-csv prefetch.csv
as well. It seems likely that there is a bug related to handling of many different scaled values -- here the query metagenome is at scaled of 10, while the database has three different scaled values of 1, 20, and 1600 (!?):
% sourmash sig summarize ERR5766174_se.sig
== This is sourmash version 4.8.5.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
** loading from 'ERR5766174_se.sig'
path filetype: MultiIndex
location: ERR5766174_se.sig
is database? no
has manifest? yes
num signatures: 1
** examining manifest...
total hashes: 4526202
summary of sketches:
1 sketches with DNA, k=31, scaled=10 4526202 total hashes
% sourmash sig summarize sourmash-db.zip
== This is sourmash version 4.8.5.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
** loading from 'sourmash-db.zip'
path filetype: ZipFileLinearIndex
location: /Users/t/dev/sourmash/failgather/sourmash-db.zip
is database? yes
has manifest? yes
num signatures: 6
** examining manifest...
total hashes: 768424
summary of sketches:
4 sketches with DNA, k=31, scaled=20 735189 total hashes
1 sketches with DNA, k=31, scaled=1 16508 total hashes
1 sketches with DNA, k=31, scaled=1600 16727 total hashes
Yes, scaling is a bit weird in this case :sweat_smile:. It was done this way because the test samples vary in size, and with uniform scaling, one sample ended up with too few hashes (so maybe it was better to take a fixed number of hashes for each sample).
However, these sketches work well with the larger database (GTDB), so I believed the issue might not be related to the different scaling.
Sorry, I'm still tracking this down! I understand where the problem lies but not exactly why it's happening - working on it over in #2832.
But I wanted to respond to this:
It was done this way because the test samples vary in size, and with uniform scaling, one sample ended up with too few hashes (so maybe it was better to take a fixed number of hashes for each sample).
Note that for comparison purposes, everything needs to be at the same scaled, so all samples are scaled "up" to the highest scaled value needed for the comparison. This operation is always possible on FracMinHash sketches (just like reducing n
is always possible on MinHash sketches). So, while you can create sketches with different scaled values, it doesn't mean that they will all be used in a specific comparison. We try to output good warning messages on this front but may not always succeed.
There is a brief technical discussion of this in the internals guide, and a more complete (but academic) discussion in Irber et al., 2022.
Dear sourmash team,
I've recently attempted to create a minimal test dataset and encountered an error when using
sourmash gather
. Specifically, it raises anAssertionError
. I think that the error might be due to the presence of very few sketches in common between the sample and the database. The error appears even with--threshold-bp 0
.Expected Behavior: While this may be a marginal case, I believe the expected behavior in this scenario is for
sourmash gather
to return an empty CSV file with no results, and subsequently an exit code of 0.Actual Behavior: I've got an
AssertionError
when running the command.Sourmash fails after Prefetch and "Doing gather to generate minimum metagenome cover".
Additional Information:
gtdb-rs214-reps.k31.zip
), it works as expected without any issues.Here is the full output log:
The command I executed was:
I've uploaded the test data here:
Thank you for providing such a valuable tool! With kind regards, Vladimir
PS. As this was a small test dataset, I used
scaled=10
to generate sample signatures, and the database signatures were also created with a different scale setting.Given that it works correctly with a larger database, the issue likely isn't related to the scaling difference.