Closed jakereps closed 3 years ago
I actually tested doing something like this early on when I was fixing something in demux. What I found was the method is already IO-bound, so adding more threads/processes doesn't really change anything as the drive can only work so quickly.
Would the ability to introduce buffering solve that IO issue though? Instead of writing to disk on every single pass of the loop each process could read in 5, 10, 15k, or even a randomly generated 10-20k+ number of sequences (to introduce even more staggered IO requirements) before having to write? That seems like the biggest stranglehold right now is writing to disk on every single loop, so it's making 15 million+ individual 4-line write calls.
It could actually even remove the maintain file handle count because you'd only have the n
threads/process specific files open at one single time, instead of a random assortment of which barcode sequence is hit first in the master file.
Python already does quite a bit of buffering behind the scenes, so it isn't actually writing 4 lines at a time, it's writing ~8192 bytes at a time. It is possible that we could buffer smarter because we know the access pattern, but that sounds hard.
Ah, I forgot about that. Where the write is actually going to memory until it hits the buffering level. Ran into an issue on tests because of that, while working on the summarize viz for demux.
Well, out of pure curiosity, I might play around with this if that's cool with you guys? It just doesn't feel right that the (relatively) simplest step of the processing pipeline, sorting the sequences into different files, takes the longest.
Of course! Let us know what you find!
Spoiler alert, no easy enhancement was found. Will close to clean up the open issue list.
Improvement Description Python has
multiprocessing.pool.Pool.map
which lets you "spin up" processes, and map a list of variables one-by-one to a function.Is this something that could be utilized by demux w.r.t. demultiplexing one barcode at a time, but in parallel?
Current Behavior Currently it goes through every sequence in the main read set file and writes it to its respective sample file, but it seems like you could spin up a few processes that read through the
sequences.fastq.gz
file in parallel and strip only what they are interested in, for writing to their respective sample files (therefore eliminating any race conditions, as each barcode/sample will only be dealt with once and won't step on another per-sample file's foot).Comments
References
multiprocessing.pool.Pool.map