`MinHash.remove_many` continues to be very slow when removing many hashes

ctb commented 3 years ago

despite https://github.com/dib-lab/sourmash/issues/1571, the problems in #1552 continued after using the new remove_many implementation until I refactored the enclosing script in #1613.

The following script reproduces the problem:

import sourmash

print('loading...')
big_sig = sourmash.load_one_signature('SRR7186547.k31.sig')
print(f'...done. loaded {len(big_sig.minhash)} hashes.')

print('converting to mutable...')
mh = big_sig.minhash.to_mutable()
print('...done.')

print('subtracting...')
mh2 = mh.remove_many(big_sig.minhash)
print('...done')

ctb commented 3 years ago

The .sig file can be downloaded here, https://osf.io/egks5/, or is in ~ctbrown/prefetch-hangest/ on farm.

mr-eyes commented 3 years ago

I think the only way to enhance this is to parallelize the removals or convert the hashes vector alongside the abundance vector to a single hashmap<hash, abundance>. In the provided example, there will be 8,577,001 search operations on the vector, which -AFAI Know- O(n) just to get the item index then remove it.

ctb commented 3 years ago

This seems to work pretty fast. It's interesting how Python is so much faster than Rust, don't you think? 😜

import sourmash

print('loading...')
big_sig = sourmash.load_one_signature('SRR7186547.k31.sig')
print(f'...done. loaded {len(big_sig.minhash)} hashes.')

print('Converting to set...')
x = set(big_sig.minhash.hashes)
print('...done!')

print('subtracting...')
y = set(x)
z = x - y
print('...done!')

mr-eyes commented 3 years ago

Hahaha, using set will be way faster ...

Here's what is happening in Rust

import sourmash

print('loading...')
big_sig = sourmash.load_one_signature('SRR7186547.k31.sig')
print(f'...done. loaded {len(big_sig.minhash)} hashes.')

print('Converting to list...')

hashes_list = list(big_sig.minhash.hashes)
abundance_list = hashes_list.copy() # Simulate abundance vector

to_be_removed = hashes_list.copy()

print('subtracting...')

for hash in to_be_removed:
    idx = hashes_list.index(hash)
    del hashes_list[idx]
    del abundance_list[idx]

print("Done")

Vectors are being used to hold hashes and abundances values to be kept in order. Using set instead of vector will not preserve the insertion order.

luizirber commented 3 years ago

Let's try to disentangle a bit the many threads going on this conversation =]

This seems to work pretty fast. It's interesting how Python is so much faster than Rust, don't you think?

import sourmash

print('loading...')
big_sig = sourmash.load_one_signature('SRR7186547.k31.sig')
print(f'...done. loaded {len(big_sig.minhash)} hashes.')

print('Converting to set...')
x = set(big_sig.minhash.hashes)
print('...done!')

print('subtracting...')
y = set(x)
z = x - y
print('...done!')

This code takes shortcuts (is not doing .to_mutable() which triggers a copy of the data, nor ending with a usable MH after the operation), so it will already be faster. Nonetheless, converting to a set makes it use twice the memory. Using memory-profiler (as @mr-eyes did in #1571) these are the results:

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     3   52.844 MiB   52.844 MiB           1   @profile
     4                                         def main():
     5   52.844 MiB    0.000 MiB           1       print('loading...')
     6  186.164 MiB  133.320 MiB           1       big_sig = sourmash.load_one_signature('SRR7186547.k31.sig')
     7  186.164 MiB    0.000 MiB           1       print(f'...done. loaded {len(big_sig.minhash)} hashes.')
     8
     9  186.164 MiB    0.000 MiB           1       print('Converting to set...')
    10  741.484 MiB  555.320 MiB           1       x = set(big_sig.minhash.hashes)
    11  741.484 MiB    0.000 MiB           1       print('...done!')
    12
    13  741.484 MiB    0.000 MiB           1       print('subtracting...')
    14 1253.238 MiB  511.754 MiB           1       y = set(x)
    15 1253.238 MiB    0.000 MiB           1       z = x - y
    16 1253.238 MiB    0.000 MiB           1       print('...done!')

My point here: the Rust code is trying to avoid allocating more memory than needed, and this is DISASTROUS with the current implementation when removing many hashes. Since it is an ordered vector, for each removal it needs to reallocate large chunks of the vector (as Mo pointed out in his explanation of what Rust is doing). This is easy to see with py-spy top -n:

Collecting samples from 'python -m memory_profiler -o rust.time slow_remove.py' (python v3.8.9)
Total Samples 248200
GIL: 0.00%, Active: 100.00%, Threads: 1

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
100.00% 100.00%    2481s     2481s   __memmove_avx_unaligned_erms (libc-2.32.so)
  0.00% 100.00%   0.620s     2482s   sourmash::sketch::minhash::KmerMinHash::remove_hash::hec5d940496ec6541 (sourmash/_lowlevel__lib.so)
  0.00% 100.00%   0.140s     2482s   sourmash::ffi::utils::landingpad::hf48eacb578fa3b98 (sourmash/_lowlevel__lib.so)
...

It literally spends all the time moving memory around.

So, what to do?

the call to .to_mutable() should be transforming the MinHash in the Rust side from a KmerMinHash into a KmerMinHashBTree. The latter supports the same operations that the set() in Python is doing, and is going to be much faster.
As Mo pointed, the Rust side still keeps abundances separated from hashes. mins and abunds can probably be merged into a hashes: BTreeMap<u64, Option<u64>> field, which lets iterate and modify both hashes and abundances much more conveniently. (Incidentally, BTreeSet<T> is implemented as a BTreeMap<T, ()>, so it should still be pretty efficient space-wise, but more testing needed)
1616 already did the legwork of spreading Frozen around, but there still need to be some method exposed on the FFI to convert from KmerMinHash :crab:/Minhash :snake: to KmerMinHashBTree :crab:/FrozenMinhash :snake: and vice-versa so we can benefit from these changes.
KmerMinHashBTree is a HORRIBLE name. Probably rename it too while doing these changes.

If you can cheat, I can cheat too =]

There is some wonkiness in the API, but I used the released crate (so no optimizations in KmerMinHashBTree like I suggested above) to write a quick-and-dirty program to do the same.

use sourmash::signature::{Signature, SigsTrait};
use sourmash::sketch::minhash::KmerMinHashBTree;
use sourmash::sketch::Sketch;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    println!("loading...");
    let (reader, _) = niffler::from_path("SRR7186547.k31.sig")?;
    let mut sig: Vec<Signature> = serde_json::from_reader(reader)?;

    if let Sketch::MinHash(big_sig) = sig.swap_remove(0).sketches().swap_remove(0) {
        println!("...done. loaded {} hashes.", big_sig.size());

        println!("converting to mutable...");
        let mut mh: KmerMinHashBTree = big_sig.into();
        println!("...done");

        println!("subtracting...");
        mh.remove_many(&mh.mins())?;
        println!("...done");
    }

    Ok(())
}

Timings

Rust:

$ /usr/bin/time -v -- cargo run --release
    Finished release [optimized] target(s) in 0.02s
     Running `target/release/remove_hashes`
loading...
...done. loaded 8679673 hashes.
converting to mutable...
...done
subtracting...
...done
        Command being timed: "cargo run --release"
        User time (seconds): 4.19
        System time (seconds): 0.53
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.72
        Maximum resident set size (kbytes): 950944
        File system inputs: 336772

Python (with sets):

$ /usr/bin/time -v -- python python_remove.py
loading...
...done. loaded 8679673 hashes.
Converting to set...
...done!
subtracting...
...done!
        Command being timed: "python python_remove.py"
        User time (seconds): 8.00
        System time (seconds): 2.56
        Percent of CPU this job got: 117%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.95
        Maximum resident set size (kbytes): 1342936
        File system inputs: 345687

ctb commented 3 years ago

If you can cheat, I can cheat too =]

😆

My intention was to point out that there must be options. Thank you for falling into my trap^W^W^W^Wexploring them!

ctb commented 3 years ago

and let me just say how adorable the :crab: and :snake: are!

luizirber commented 3 years ago

My intention was to point out that there must be options. Thank you for falling into my trap^W^W^W^Wexploring them!

I know, I know. Just pointing that the options where there all along, but... not implemented all the way across FFI =P

and let me just say how adorable the :crab: and :snake: are!

Right? gonna start using it all the time when talking about Rust and Python types :smile_cat:

mr-eyes commented 2 years ago

Coming from #1771

https://github.com/sourmash-bio/sourmash/blob/401ba4873f2d81a86c3046d2b61613005dc8423a/src/core/src/sketch/minhash.rs#L399-L409 Performing a binary search on every delete is expected to slow down the process of removing many elements.

Would replacing the vec<hashes> & vec<abundance> with a hashmap<hash, abundance> be the optimal solution here?

mr-eyes commented 2 years ago

Coming from #1771

https://github.com/sourmash-bio/sourmash/blob/401ba4873f2d81a86c3046d2b61613005dc8423a/src/core/src/sketch/minhash.rs#L399-L409

Performing a binary search on every delete is expected to slow down the process of removing many elements. Would replacing the vec<hashes> & vec<abundance> with a hashmap<hash, abundance> be the optimal solution here?

Ok, that was already discussed :neckbeard:.

ctb commented 2 years ago

we should revisit the code in https://github.com/sourmash-bio/sourmash/pull/2123 if/when we speed up remove_many.

ctb commented 2 years ago

I ran the scaled=1000 version of the benchmark in #1747:

time py-spy record -o latest.svg -- sourmash gather SRR10988543.sig bins.sbt.zip >& latest.out

and saw the following:

so, still some work to do here :)

It took < 2 hours to run, but not by much.

ctb commented 2 years ago

oh, and when we zoom in on the block on the right, we see remove_many continues to be a problem: it's the big block on the left.

the other big chunk of time is in this generator expression in search.py link to latest:

        weighted_missed = sum((orig_query_abunds[k] for k in query_hashes))

sourmash-bio / sourmash