sourmash-bio / sourmash_plugin_branchwater

fast, multithreaded sourmash operations: search, compare, and gather.
GNU Affero General Public License v3.0
14 stars 2 forks source link

provide a better error when trying to read a zip file without a manifest #264

Open ctb opened 4 months ago

ctb commented 4 months ago

When we run:

scripts manysearch zip-with-manifest.zip zip-with-manifest.zip -o foo.csv

it works fine; but with

sourmash scripts manysearch zip-without-manifest.zip zip-without-manifest.zip -o foo.csv

we get

Reading query(s) from: 'zip-without-manifest.zip'
Sketch loading error: No such file or directory (os error 2)
WARNING: could not load sketches from path 'PK'
No valid signatures found in query pathlist 'zip-without-manifest.zip'
WARNING: 1 query paths failed to load. See error messages above.
No query signatures loaded, exiting.
Reading search(s) from: 'zip-without-manifest.zip'
Sketch loading error: No such file or directory (os error 2)
WARNING: could not load sketches from path 'PK'
No valid signatures found in search pathlist 'zip-without-manifest.zip'
WARNING: 1 search paths failed to load. See error messages above.
No search signatures loaded, exiting.
DONE. Processed 0 search sigs
...manysearch is done! results in 'foo.csv'

It's fine that it doesn't work but it'd be nice to provide a better error message, like "this is a zip file without a manifest, you suxor."

Related issues:

ctb commented 4 months ago

So, this seems a little weird - in utils.rs,

pub fn collection_from_zipfile(sigpath: &Path, report_type: &ReportType) -> Result<Collection> {
    match Collection::from_zipfile(sigpath) {
        Ok(collection) => Ok(collection),
        Err(_) => bail!("failed to load {} zipfile: '{}'", report_type, sigpath\
),
    }
}

and then in sourmash collection.rs,

    pub fn from_zipfile<P: AsRef<Path>>(zipfile: P) -> Result<Self> {
        let storage = ZipStorage::from_file(zipfile)?;
        // Load manifest from standard location in zipstorage                   
        let manifest = Manifest::from_reader(storage.load("SOURMASH-MANIFEST.csv")?.as_slice())?;
        Ok(Self {
            manifest,
            storage: InnerStorage::new(storage),
        })
    }

and it would seem to me that storage.load should fail if SOURMASH-MANIFEST.csv doesn't exist in the zip file, and the error should be propogated.

ctb commented 4 months ago

Figured it out - load_collection is trapping and ignoring the error from collection_from_zipfile.

My guess is that if we encounter an unloadable zipfile we should error-exit - there's no other collection type that should end in .zip - but will think on't.

bluegenes commented 4 months ago

My guess is that if we encounter an unloadable zipfile we should error-exit - there's no other collection type that should end in .zip - but will think on't.

Agreed.

this comment is perhaps better here, since I suggested the same solution: https://github.com/sourmash-bio/sourmash_plugin_branchwater/issues/280#issuecomment-2010076172

So the reason errors are not automatically propagated up is because of way I changed sequential loading to allow manifests (in load_collection). Previously, if the extension was 'zip', we tried loading as a zipfile. If not, we tried as sig first and then fell back to pathlists. Since manifests and pathlists are so similar, I was having trouble integrating manifests into the same strategy.

Now, we try each loading function (zip > manifest > signature > pathlist). If the file can be loaded, we will load, even if that collection is empty. If we encounter an error, we track it and report the final error. I thought this was working well for reporting the right errors, but apparently not when the zip fails?

I'm definitely open to better sequential loading / error propagation strategies.

A simple zipfile fix would be to only use the zip loading for '.zip' and report errors directly. For the rest, I'm not sure how to better manage loading functions without enforcing file extensions or otherwise specifying which type of input we have (and therefore which loading methods we should try).

ctb commented 4 months ago

Right, we run into the same problem in sourmash of course - load_save.py. There I think we catch only ValueError as "this is likely NOT the right database, so try next" and/or detect None being returned as the same thing, while other exceptions are raised and reported.