sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
466 stars 79 forks source link

Failed assertion when using `compare` with `--ani` in parallel #2262

Closed dkoslicki closed 2 years ago

dkoslicki commented 2 years ago

I have a protein signature:

$ sourmash sig fileinfo kegg_genes_KO_1000.faa_scale_1.db.zip

== This is sourmash version 4.4.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'kegg_genes_KO_1000.faa_scale_1.db.zip'
path filetype: ZipFileLinearIndex
location: /data/shared_data/KEGG_data/sourmash_sketches/output/small_subset/kegg_genes_KO_1000.faa_scale_1.db.zip
is database? yes
has manifest? yes
num signatures: 4000
** examining manifest...
total hashes: 1358843
summary of sketches:
   1000 sketches with protein, k=5, scaled=1, abund   344055 total hashes
   1000 sketches with protein, k=7, scaled=1, abund   342227 total hashes
   1000 sketches with protein, k=11, scaled=1, abund  338275 total hashes
   1000 sketches with protein, k=15, scaled=1, abund  334286 total hashes

When running:

$ sourmash compare kegg_genes_KO_1000.faa_scale_1.db.zip -k 11 --protein --ani -o kegg_genes_KO_1000.faa_scale_1.db.k_11_compare -p 10 --no-dna

I get the following error:

== This is sourmash version 4.4.3. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 1000 signatures total.
NOTE: --containment, --max-containment, --avg-containment, and --estimate-ani ignore signature abundances.

Created memmapped siglist
Initialized memmapped similarities matrix
Created similarity func
Calculated chunk size for multiprocessing
Initialized multiprocessing pool.imap
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/faculty/dmk333/.conda/envs/sourmash_env/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/faculty/dmk333/.conda/envs/sourmash_env/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/faculty/dmk333/.conda/envs/sourmash_env/lib/python3.8/site-packages/sourmash/compare.py", line 210, in get_similarities_at_index
    similarity_list = list(map(func, sig_iterator))
  File "/home/faculty/dmk333/.conda/envs/sourmash_env/lib/python3.8/site-packages/sourmash/compare.py", line 176, in similarity_args_unpack
    ani = sig1.jaccard_ani(sig2, downsample=downsample).ani
  File "/home/faculty/dmk333/.conda/envs/sourmash_env/lib/python3.8/site-packages/sourmash/signature.py", line 147, in jaccard_ani
    return self.minhash.jaccard_ani(other.minhash, downsample=downsample,
  File "/home/faculty/dmk333/.conda/envs/sourmash_env/lib/python3.8/site-packages/sourmash/minhash.py", line 702, in jaccard_ani
    j_aniresult = jaccard_to_distance(jaccard, self_mh.ksize, scaled,
  File "/home/faculty/dmk333/.conda/envs/sourmash_env/lib/python3.8/site-packages/sourmash/minhash.py", line 544, in ksize
    assert k % 3 == 0
AssertionError
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/faculty/dmk333/.conda/envs/sourmash_env/bin/sourmash", line 11, in <module>
    sys.exit(main())
  File "/home/faculty/dmk333/.conda/envs/sourmash_env/lib/python3.8/site-packages/sourmash/__main__.py", line 13, in main
    return mainmethod(args)
  File "/home/faculty/dmk333/.conda/envs/sourmash_env/lib/python3.8/site-packages/sourmash/cli/compare.py", line 91, in main
    return sourmash.commands.compare(args)
  File "/home/faculty/dmk333/.conda/envs/sourmash_env/lib/python3.8/site-packages/sourmash/commands.py", line 162, in compare
    similarity = compare_all_pairs(siglist, args.ignore_abundance,
  File "/home/faculty/dmk333/.conda/envs/sourmash_env/lib/python3.8/site-packages/sourmash/compare.py", line 314, in compare_all_pairs
    similarities = compare_parallel(siglist, ignore_abundance, downsample, n_jobs, return_ani=return_ani)
  File "/home/faculty/dmk333/.conda/envs/sourmash_env/lib/python3.8/site-packages/sourmash/compare.py", line 278, in compare_parallel
    for index, l in enumerate(result):
  File "/home/faculty/dmk333/.conda/envs/sourmash_env/lib/python3.8/multiprocessing/pool.py", line 420, in <genexpr>
    return (item for chunk in result for item in chunk)
  File "/home/faculty/dmk333/.conda/envs/sourmash_env/lib/python3.8/multiprocessing/pool.py", line 868, in next
    raise value
AssertionError

This does not occur in the serial version:

$ sourmash compare kegg_genes_KO_1000.faa_scale_1.db.zip -k 11 --protein --ani -o kegg_genes_KO_1000.faa_scale_1.db.k_11_compare --no-dna

Nor in the parallelized, non-ANI version:

$ sourmash compare kegg_genes_KO_1000.faa_scale_1.db.zip -k 11 --protein -o kegg_genes_KO_1000.faa_scale_1.db.k_11_compare --no-dna -p 10

It appears that it doesn't like the k-mer size self_mh.ksize that is being passed as an argument here. I hypothesize this is due to a k-mer size not being multiplied by 3 somewhere. Eg. similarity does not appear to call self_mh.ksize anywhere, yet jaccard_ani does.

Commenting out this line of code makes the error go away. But I don't know the ramification of removing this assertion to other parts of the source code.

dkoslicki commented 2 years ago

Disconcertingly, when adding a print(k) here, it appears to bounce back and forth between the original k passed in from the CLI and 3*k.

@bluegenes Is it possible that there's some oddity about when and where the -k is (or isn't) converted to 3*k and/or k/3?

ctb commented 2 years ago

yes, the Rust layer uses k*3 internally for protein, while the Python layer uses k. Looks like you found a place we didn't do a translation :)

note, see https://github.com/sourmash-bio/sourmash/issues/1383

will look into it!

ctb commented 2 years ago

whew, that was a journey.

Turns out that we are not properly handling pickle-based serialization of non-DNA MinHash objects. This is a general problem; here it was triggered by the use of serialization to communicate objects between different processes.

In slightly more detail,

Adding to this, it looks like numpy is implicitly/silently using the __reduce__ protocol.

Anyhoo. Fixed in https://github.com/sourmash-bio/sourmash/pull/2265.

ctb commented 2 years ago

whoops, looks like it's multiprocessing that was correctly using the pickle protocol, which is even documented properly; and it was just stumbling over broken sourmash code. So I made this more complicated than it needed to be in the previous comment 😆