sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
473 stars 80 forks source link

what can we use for multithreaded/multiprocess safe signature writing? #1911

Open ctb opened 2 years ago

ctb commented 2 years ago

per https://github.com/sourmash-bio/sourmash/issues/1671#issuecomment-1077585283, we don't have a good way to write signatures to a single file from multiple processes. Such a thing would be nice, but would probably entail some kind of locking...

This is a challenge for things like sourmash sketch fromfile where we would like to have parallel signature sketching but would need to make the output single-threaded.

ctb commented 2 years ago

keywords: parallel output

ctb commented 2 years ago

idea in #2033: write to many different files, use a single manifest to point at the different files.

ctb commented 2 years ago

@dkoslicki asked about multithreaded sketching on matrix chat; I responded:

dug into this a bit more, the real problem is here: https://github.com/sourmash-bio/sourmash/issues/1911 - we don't have a good way to write sketches to a single file from multiple processes.

@luizirber added -

But we can set a specific thread to write and still calculate the sigs in parallel. I did a quick skim in sourmash sketch fromfile and the place that need to change is https://github.com/sourmash-bio/sourmash/blob/b1ddabcb05d3455affa862df33b039348c437d61/src/sourmash/command_sketch.py#L301L351

either going multiprocessing or rewriting this function in Rust can achieve parallelism. Easier if the order of the sketches added is not important (and I don't think it is? manifests are the source of order), but even if it needs to be in the same order it is doable.

ctb commented 1 year ago

see plugin https://github.com/sourmash-bio/pyo3_branchwater, manysketch command, which writes to zip files.