qiime2 / q2-demux

BSD 3-Clause "New" or "Revised" License
0 stars 20 forks source link

Possible demux enhancement (RFC?) #51

Closed jakereps closed 3 years ago

jakereps commented 7 years ago

Improvement Description Python has multiprocessing.pool.Pool.map which lets you "spin up" processes, and map a list of variables one-by-one to a function.

Is this something that could be utilized by demux w.r.t. demultiplexing one barcode at a time, but in parallel?

Current Behavior Currently it goes through every sequence in the main read set file and writes it to its respective sample file, but it seems like you could spin up a few processes that read through the sequences.fastq.gz file in parallel and strip only what they are interested in, for writing to their respective sample files (therefore eliminating any race conditions, as each barcode/sample will only be dealt with once and won't step on another per-sample file's foot).

Comments

  1. This could allow buffering of the reading from the main file, preventing it from having to do so many writes to disk, and potentially greatly improving the runtime speeds.
  2. Mainly just a question as a possible enhancement. Totally unsure if it would work the way I'm picturing it, or if there were any better way to "thread" with python and parallelize this process.
  3. One example reason for this enhancement would be that a MiSeq run (~15mil seqs) takes about an hour to demultiplex (the longest step in the process from getting raw sequences through to taxonomy classification).

References multiprocessing.pool.Pool.map

ebolyen commented 7 years ago

I actually tested doing something like this early on when I was fixing something in demux. What I found was the method is already IO-bound, so adding more threads/processes doesn't really change anything as the drive can only work so quickly.

jakereps commented 7 years ago

Would the ability to introduce buffering solve that IO issue though? Instead of writing to disk on every single pass of the loop each process could read in 5, 10, 15k, or even a randomly generated 10-20k+ number of sequences (to introduce even more staggered IO requirements) before having to write? That seems like the biggest stranglehold right now is writing to disk on every single loop, so it's making 15 million+ individual 4-line write calls.

jakereps commented 7 years ago

It could actually even remove the maintain file handle count because you'd only have the n threads/process specific files open at one single time, instead of a random assortment of which barcode sequence is hit first in the master file.

ebolyen commented 7 years ago

Python already does quite a bit of buffering behind the scenes, so it isn't actually writing 4 lines at a time, it's writing ~8192 bytes at a time. It is possible that we could buffer smarter because we know the access pattern, but that sounds hard.

jakereps commented 7 years ago

Ah, I forgot about that. Where the write is actually going to memory until it hits the buffering level. Ran into an issue on tests because of that, while working on the summarize viz for demux.

Well, out of pure curiosity, I might play around with this if that's cool with you guys? It just doesn't feel right that the (relatively) simplest step of the processing pipeline, sorting the sequences into different files, takes the longest.

ebolyen commented 7 years ago

Of course! Let us know what you find!

jakereps commented 3 years ago

Spoiler alert, no easy enhancement was found. Will close to clean up the open issue list.