sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
471 stars 80 forks source link

Sourmash panicked when it "Couldn't find End Of Central Directory Record" #3190

Closed ccbaumler closed 3 months ago

ccbaumler commented 4 months ago

The command

While building the AllTheBacteria sourmash DB, I am using:

find /group/ctbrowngrp4/2024-ccbaumler-allthebacteria/allthebacteria-r0.2-sigs/ -maxdepth 2 -type f -name "*.zip" -exec sh -c 'sourmash sig cat "$@" -o "$0" ' "../allthebacteria-r0.2.zip" {} +

This finds all the zip files nested in the path The zip files found are placed into a bash array and used in the execution of sourmash sig cat

The error

The error produced:

sourmash.exceptions.Panic: sourmash panicked: thread 'unnamed' panicked with 'called `Result::unwrap()` on an `Err` value: InvalidArchive("Couldn't find End Of Central Directory Record")' at src/core/src/storage.rs:358 

The investigation

I seen two possible errors immediately:

  1. There is an issue with one of the 665 sourmash databases I created from the AllTheBacteria tar.xz files
  2. There was not enough memory and the command ended.

Due to the random order when using the find command I do not know which file the error occurred on. Therefore, I have run two separate attempts to find a signature that replicates the error above:

find /group/ctbrowngrp4/2024-ccbaumler-allthebacteria/allthebacteria-r0.2-sigs/ -maxdepth 2 -type f -name "*.zip" -exec sh -c 'sourmash sig summarize "$0" ' {} \; | awk '{print}' ORS='" ' 2>&1 | tee -a summarize.log

This command will find all the zip files and execute a sig summarize for each one found. The output is converted into a single line by defining the Output Record Separator to a '" '. According to @ctb , summarize may only look at the manifest. "sig cat and sig describe load the sketches themselves" There was no error found in the

I am currently running this command to investigate further:

find /group/ctbrowngrp4/2024-ccbaumler-allthebacteria/allthebacteria-r0.2-sigs/ -maxdepth 2 -type f -name "*.zip" -exec sh -c 'echo "$0"; sourmash sig summarize "$0"; sourmash sig cat "$0" -q -o trash' {} \; 2>&1 | tee -a blah.log

I am also attempting to sig cat only one k size at a time instead of all three. In case it is a working memory error.

ctb commented 4 months ago

ah-hah! I am virtually positive that the error is from zip itself, so sig summarize should trigger it, as should a straight up unzip -v. You might look for a zero-size zip file.

It may also be that sig summarize is handling the error properly while sig cat is not.

I'll have to think about ways to track this down and/or better handle this kind of error. Thanks for reporting!

ccbaumler commented 4 months ago

The final command I listed worked like a charm. Took some time to run through all 700 files, but I was easily able to find the culprit by searching the log file created.

While each of the commands @ctb listed return a similar error, unzip -v did so the fastest.

sig summarize

sourmash sig summarize /group/ctbrowngrp4/2024-ccbaumler-allthebacteria/allthebacteria-r0.2-sigs/unknown__06/unknown__06.zip
Error message ``` == This is sourmash version 4.8.5. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == ** loading from '/group/ctbrowngrp4/2024-ccbaumler-allthebacteria/allthebacteria-r0.2-sigs/unknown__06/unknown__06.zip' Traceback (most recent call last): File "/home/baumlerc/miniforge3/envs/sourmash/bin/sourmash", line 11, in sys.exit(main()) ^^^^^^ File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/__main__.py", line 19, in main retval = mainmethod(args) ^^^^^^^^^^^^^^^^ File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/cli/sig/fileinfo.py", line 46, in main return sourmash.sig.__main__.fileinfo(args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/sig/__main__.py", line 1274, in fileinfo idx = sourmash_args.load_file_as_index(args.path, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/save_load.py", line 65, in load_file_as_index return _load_database(filename, yield_all_files) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/save_load.py", line 113, in _load_database db = load_fn(filename, ^^^^^^^^^^^^^^^^^ File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/save_load.py", line 216, in _load_zipfile db = ZipFileLinearIndex.load(filename, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/index/__init__.py", line 586, in load storage = ZipStorage(location) ^^^^^^^^^^^^^^^^^^^^ File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/sbt_storage.py", line 107, in __init__ self._objptr = rustcall(lib.zipstorage_new, to_bytes(path), len(path)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/utils.py", line 78, in rustcall raise exc sourmash.exceptions.Panic: sourmash panicked: thread 'unnamed' panicked with 'called `Result::unwrap()` on an `Err` value: InvalidArchive("Couldn't find End Of Central Directory Record")' at src/core/src/storage.rs:358 ```
0.70user 1.22system 0:04.13elapsed 46%CPU (0avgtext+0avgdata 565248maxresident)k
967504inputs+8outputs (7118major+33529minor)pagefaults 0swaps

sig cat

sourmash sig cat /group/ctbrowngrp4/2024-ccbaumler-allthebacteria/allthebacteria-r0.2-sigs/unknown__06/unknown__06.zip -o delet-me
Error Message ``` == This is sourmash version 4.8.5. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == Traceback (most recent call last): File "/home/baumlerc/miniforge3/envs/sourmash/bin/sourmash", line 11, in sys.exit(main()) ^^^^^^ File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/__main__.py", line 19, in main retval = mainmethod(args) ^^^^^^^^^^^^^^^^ File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/cli/sig/cat.py", line 58, in main return sourmash.sig.__main__.cat(args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/sig/__main__.py", line 130, in cat for ss, sigloc in loader: File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/sourmash_args.py", line 642, in load_many_signatures idx = load_file_as_index(loc, yield_all_files=yield_all_files) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/save_load.py", line 65, in load_file_as_index return _load_database(filename, yield_all_files) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/save_load.py", line 113, in _load_database db = load_fn(filename, ^^^^^^^^^^^^^^^^^ File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/save_load.py", line 216, in _load_zipfile db = ZipFileLinearIndex.load(filename, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/index/__init__.py", line 586, in load storage = ZipStorage(location) ^^^^^^^^^^^^^^^^^^^^ File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/sbt_storage.py", line 107, in __init__ self._objptr = rustcall(lib.zipstorage_new, to_bytes(path), len(path)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/baumlerc/miniforge3/envs/sourmash/lib/python3.12/site-packages/sourmash/utils.py", line 78, in rustcall raise exc sourmash.exceptions.Panic: sourmash panicked: thread 'unnamed' panicked with 'called `Result::unwrap()` on an `Err` value: InvalidArchive("Couldn't find End Of Central Directory Record")' at src/core/src/storage.rs:358 ```
0.79user 1.53system 0:04.34elapsed 53%CPU (0avgtext+0avgdata 563200maxresident)k
968704inputs+8outputs (7151major+33477minor)pagefaults 0swaps

unzip -v

unzip -v  /group/ctbrowngrp4/2024-ccbaumler-allthebacteria/allthebacteria-r0.2-sigs/unknown__06/unknown__06.zip -d delete-me/
caution:  not extracting; -d ignored
Archive:  /group/ctbrowngrp4/2024-ccbaumler-allthebacteria/allthebacteria-r0.2-sigs/unknown__06/unknown__06.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of /group/ctbrowngrp4/2024-ccbaumler-allthebacteria/allthebacteria-r0.2-sigs/unknown__06/unknown__06.zip or
        /group/ctbrowngrp4/2024-ccbaumler-allthebacteria/allthebacteria-r0.2-sigs/unknown__06/unknown__06.zip.zip, and cannot find /group/ctbrowngrp4/2024-ccbaumler-allthebacteria/allthebacteria-r0.2-sigs/unknown__06/unknown__06.zip.ZIP, period.
0.00user 0.00system 0:00.00elapsed 37%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+354minor)pagefaults 0swaps
ctb commented 3 months ago

OK, so this error is triggered by faulty zip files. Maybe we should be returning a better error when the zip file is faulty 🤔

ctb commented 3 months ago

punting to #3213