Open ctb opened 3 years ago
The .sig file can be downloaded here, https://osf.io/egks5/, or is in ~ctbrown/prefetch-hangest/
on farm.
I think the only way to enhance this is to parallelize the removals or convert the hashes vector alongside the abundance vector to a single hashmap<hash, abundance>. In the provided example, there will be 8,577,001 search operations on the vector, which -AFAI Know- O(n) just to get the item index then remove it.
This seems to work pretty fast. It's interesting how Python is so much faster than Rust, don't you think? 😜
import sourmash
print('loading...')
big_sig = sourmash.load_one_signature('SRR7186547.k31.sig')
print(f'...done. loaded {len(big_sig.minhash)} hashes.')
print('Converting to set...')
x = set(big_sig.minhash.hashes)
print('...done!')
print('subtracting...')
y = set(x)
z = x - y
print('...done!')
Hahaha, using set will be way faster ...
Here's what is happening in Rust
import sourmash
print('loading...')
big_sig = sourmash.load_one_signature('SRR7186547.k31.sig')
print(f'...done. loaded {len(big_sig.minhash)} hashes.')
print('Converting to list...')
hashes_list = list(big_sig.minhash.hashes)
abundance_list = hashes_list.copy() # Simulate abundance vector
to_be_removed = hashes_list.copy()
print('subtracting...')
for hash in to_be_removed:
idx = hashes_list.index(hash)
del hashes_list[idx]
del abundance_list[idx]
print("Done")
Vectors are being used to hold hashes and abundances values to be kept in order. Using set instead of vector will not preserve the insertion order.
Let's try to disentangle a bit the many threads going on this conversation =]
This seems to work pretty fast. It's interesting how Python is so much faster than Rust, don't you think?
import sourmash print('loading...') big_sig = sourmash.load_one_signature('SRR7186547.k31.sig') print(f'...done. loaded {len(big_sig.minhash)} hashes.') print('Converting to set...') x = set(big_sig.minhash.hashes) print('...done!') print('subtracting...') y = set(x) z = x - y print('...done!')
This code takes shortcuts (is not doing .to_mutable()
which triggers a copy of the data, nor ending with a usable MH after the operation), so it will already be faster. Nonetheless, converting to a set
makes it use twice the memory. Using memory-profiler
(as @mr-eyes did in #1571) these are the results:
Line # Mem usage Increment Occurences Line Contents
============================================================
3 52.844 MiB 52.844 MiB 1 @profile
4 def main():
5 52.844 MiB 0.000 MiB 1 print('loading...')
6 186.164 MiB 133.320 MiB 1 big_sig = sourmash.load_one_signature('SRR7186547.k31.sig')
7 186.164 MiB 0.000 MiB 1 print(f'...done. loaded {len(big_sig.minhash)} hashes.')
8
9 186.164 MiB 0.000 MiB 1 print('Converting to set...')
10 741.484 MiB 555.320 MiB 1 x = set(big_sig.minhash.hashes)
11 741.484 MiB 0.000 MiB 1 print('...done!')
12
13 741.484 MiB 0.000 MiB 1 print('subtracting...')
14 1253.238 MiB 511.754 MiB 1 y = set(x)
15 1253.238 MiB 0.000 MiB 1 z = x - y
16 1253.238 MiB 0.000 MiB 1 print('...done!')
My point here: the Rust code is trying to avoid allocating more memory than needed, and this is DISASTROUS with the current implementation when removing many hashes. Since it is an ordered vector, for each removal it needs to reallocate large chunks of the vector (as Mo pointed out in his explanation of what Rust is doing). This is easy to see with py-spy top -n
:
Collecting samples from 'python -m memory_profiler -o rust.time slow_remove.py' (python v3.8.9)
Total Samples 248200
GIL: 0.00%, Active: 100.00%, Threads: 1
%Own %Total OwnTime TotalTime Function (filename:line)
100.00% 100.00% 2481s 2481s __memmove_avx_unaligned_erms (libc-2.32.so)
0.00% 100.00% 0.620s 2482s sourmash::sketch::minhash::KmerMinHash::remove_hash::hec5d940496ec6541 (sourmash/_lowlevel__lib.so)
0.00% 100.00% 0.140s 2482s sourmash::ffi::utils::landingpad::hf48eacb578fa3b98 (sourmash/_lowlevel__lib.so)
...
It literally spends all the time moving memory around.
So, what to do?
.to_mutable()
should be transforming the MinHash in the Rust side from a KmerMinHash
into a KmerMinHashBTree
. The latter supports the same operations that the set()
in Python is doing, and is going to be much faster.mins
and abunds
can probably be merged into a hashes: BTreeMap<u64, Option<u64>>
field, which lets iterate and modify both hashes
and abundances
much more conveniently. (Incidentally, BTreeSet<T>
is implemented as a BTreeMap<T, ()>
, so it should still be pretty efficient space-wise, but more testing needed)Frozen
around, but there still need to be some method exposed on the FFI to convert from KmerMinHash
:crab:/Minhash
:snake: to KmerMinHashBTree
:crab:/FrozenMinhash
:snake: and vice-versa so we can benefit from these changes.KmerMinHashBTree
is a HORRIBLE name. Probably rename it too while doing these changes.If you can cheat, I can cheat too =]
There is some wonkiness in the API, but I used the released crate (so no optimizations in KmerMinHashBTree
like I suggested above) to write a quick-and-dirty program to do the same.
use sourmash::signature::{Signature, SigsTrait};
use sourmash::sketch::minhash::KmerMinHashBTree;
use sourmash::sketch::Sketch;
fn main() -> Result<(), Box<dyn std::error::Error>> {
println!("loading...");
let (reader, _) = niffler::from_path("SRR7186547.k31.sig")?;
let mut sig: Vec<Signature> = serde_json::from_reader(reader)?;
if let Sketch::MinHash(big_sig) = sig.swap_remove(0).sketches().swap_remove(0) {
println!("...done. loaded {} hashes.", big_sig.size());
println!("converting to mutable...");
let mut mh: KmerMinHashBTree = big_sig.into();
println!("...done");
println!("subtracting...");
mh.remove_many(&mh.mins())?;
println!("...done");
}
Ok(())
}
Rust:
$ /usr/bin/time -v -- cargo run --release
Finished release [optimized] target(s) in 0.02s
Running `target/release/remove_hashes`
loading...
...done. loaded 8679673 hashes.
converting to mutable...
...done
subtracting...
...done
Command being timed: "cargo run --release"
User time (seconds): 4.19
System time (seconds): 0.53
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.72
Maximum resident set size (kbytes): 950944
File system inputs: 336772
Python (with sets):
$ /usr/bin/time -v -- python python_remove.py
loading...
...done. loaded 8679673 hashes.
Converting to set...
...done!
subtracting...
...done!
Command being timed: "python python_remove.py"
User time (seconds): 8.00
System time (seconds): 2.56
Percent of CPU this job got: 117%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.95
Maximum resident set size (kbytes): 1342936
File system inputs: 345687
If you can cheat, I can cheat too =]
😆
My intention was to point out that there must be options. Thank you for falling into my trap^W^W^W^Wexploring them!
and let me just say how adorable the :crab: and :snake: are!
My intention was to point out that there must be options. Thank you for falling into my trap^W^W^W^Wexploring them!
I know, I know. Just pointing that the options where there all along, but... not implemented all the way across FFI =P
and let me just say how adorable the :crab: and :snake: are!
Right? gonna start using it all the time when talking about Rust and Python types :smile_cat:
Coming from #1771
https://github.com/sourmash-bio/sourmash/blob/401ba4873f2d81a86c3046d2b61613005dc8423a/src/core/src/sketch/minhash.rs#L399-L409 Performing a binary search on every delete is expected to slow down the process of removing many elements.
Would replacing the vec<hashes>
& vec<abundance>
with a hashmap<hash, abundance>
be the optimal solution here?
Coming from #1771
Performing a binary search on every delete is expected to slow down the process of removing many elements. Would replacing the
vec<hashes>
&vec<abundance>
with ahashmap<hash, abundance>
be the optimal solution here?
Ok, that was already discussed :neckbeard:.
we should revisit the code in https://github.com/sourmash-bio/sourmash/pull/2123 if/when we speed up remove_many
.
I ran the scaled=1000 version of the benchmark in #1747:
time py-spy record -o latest.svg -- sourmash gather SRR10988543.sig bins.sbt.zip >& latest.out
and saw the following:
so, still some work to do here :)
It took < 2 hours to run, but not by much.
oh, and when we zoom in on the block on the right, we see remove_many
continues to be a problem: it's the big block on the left.
the other big chunk of time is in this generator expression in search.py
link to latest:
weighted_missed = sum((orig_query_abunds[k] for k in query_hashes))
despite https://github.com/dib-lab/sourmash/issues/1571, the problems in #1552 continued after using the new
remove_many
implementation until I refactored the enclosing script in #1613.The following script reproduces the problem: