sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
475 stars 80 forks source link

`sourmash sig describe` can't process large signatures. #2966

Open mr-eyes opened 9 months ago

mr-eyes commented 9 months ago

This is an expected limitation in sourmash when working with signatures in Python data structures. Python dictionaries are restricted in the amount of data they can hold, regardless of available RAM.

Working on a super large signature like this can be rare to happen, but I am reporting the bug anyway.

sourmash sig describe kmers.sig

== This is sourmash version 4.8.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

Traceback (most recent call last):
  File "/home/mhussien/miniconda3/envs/test/bin/sourmash", line 11, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/mhussien/miniconda3/envs/test/lib/python3.11/site-packages/sourmash/__main__.py", line 19, in main
    retval = mainmethod(args)
             ^^^^^^^^^^^^^^^^
  File "/home/mhussien/miniconda3/envs/test/lib/python3.11/site-packages/sourmash/cli/sig/describe.py", line 60, in main
    return sourmash.sig.__main__.describe(args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mhussien/miniconda3/envs/test/lib/python3.11/site-packages/sourmash/sig/__main__.py", line 253, in describe
    sum_hashes = sum(mh.hashes.values())
                     ^^^^^^^^^
  File "/home/mhussien/miniconda3/envs/test/lib/python3.11/site-packages/sourmash/minhash.py", line 508, in hashes
    return _HashesWrapper({ k : 1 for k in d })
                          ^^^^^^^^^^^^^^^^^^^^
  File "/home/mhussien/miniconda3/envs/test/lib/python3.11/site-packages/sourmash/minhash.py", line 508, in <dictcomp>
    return _HashesWrapper({ k : 1 for k in d })
                          ^^^^^^^^^^^^^^^^^^^^
MemoryError
ctb commented 9 months ago

Yep - and in fact, the whole _HashesWrapper area of things is a great target for oxidation 🦀 , with potentially far-reaching speed and memory improvements!

This might be something that https://github.com/sourmash-bio/sourmash/pull/2943 will help with, or perhaps a targeted effort independently of that.

ctb commented 9 months ago

also kind of related: https://github.com/sourmash-bio/sourmash/issues/2898

ctb commented 9 months ago

(Yes, the relevant calculations are being moved into rust in #2943.)