sourmash-bio / sourmash_plugin_branchwater

fast, multithreaded sourmash operations: search, compare, and gather.
GNU Affero General Public License v3.0
14 stars 2 forks source link

`multisearch` continues past "no query signatures loaded, exiting" message #280

Open ctb opened 4 months ago

ctb commented 4 months ago

when I run:

sourmash scripts multisearch zip-without-manifest.zip zip-with-manifest.zip -o zzz.csv

I get:

=> sourmash_plugin_branchwater 0.9.3.dev0; cite Irber et al., doi: 10.1101/2022.11.02.514947

ksize: 31 / scaled: 1000 / moltype: DNA / threshold: 0.01
searching all sketches in 'zip-without-manifest.zip' against 'zip-with-manifest.zip' using 8 threads
Reading query(s) from: 'zip-without-manifest.zip'
Sketch loading error: No such file or directory (os error 2)
WARNING: could not load sketches from path 'PK'
No valid signatures found in query pathlist 'zip-without-manifest.zip'
WARNING: 1 query paths failed to load. See error messages above.
No query signatures loaded, exiting.
Reading search(s) from: 'zip-with-manifest.zip'
Loaded 1 search signature(s)
DONE. Processed 0 comparisons
...multisearch is done! results in 'zzz.csv'

it seems a little weird it tells me that "No query signatures loaded, exiting." and then ...keeps on going.

What's even weirder is that I can't figure out where the message is coming from - the strong "No query signatures loaded" doesn't seem to show up in either the branchwater plugin repo, or the sourmash repo!?

ctb commented 4 months ago

ahh, found it in utils.rs:

    // Validate sketches                                                        
    if collection.is_empty() {
        eprintln!("No {} signatures loaded, exiting.", report_type);
        return Ok(());
    }
    eprintln!("Loaded {} {} signature(s)", collection.len(), report_type);
    Ok(())
bluegenes commented 4 months ago

Right, this is related to https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/207.

I can make a new issue to better reflect the new goal, which is to actually err from here, rather than returning Ok.

ctb commented 4 months ago

I was thinking of linking that issue in, but I feel like this is slightly different, because this is a specific error that shouldn't be bypassed. I'll muse more.

bluegenes commented 4 months ago

ok, sure - I see it. Here you're specifically thinking about zip errors.

So the reason errors are not automatically propagated up is because of way I changed sequential loading to allow manifests (in load_collection). Previously, if the extension was 'zip', we tried loading as a zipfile. If not, we tried as sig first and then fell back to pathlists. Since manifests and pathlists are so similar, I was having trouble integrating manifests into the same strategy.

Now, we try each loading function (zip > manifest > signature > pathlist). If the file can be loaded, we will load, even if that collection is empty. If we encounter an error, we track it and report the final error. I thought this was working well for reporting the right errors, but apparently not when the zip fails?

I'm definitely open to better sequential loading / error propagation strategies.

A simple zipfile fix would be to only use the zip loading for '.zip' and report errors directly. For the rest, I'm not sure how to better manage loading functions without enforcing file extensions or otherwise specifying which type of input we have (and therefore which loading methods we should try).