nanoporetech / pod5-file-format

Pod5: a high performance file format for nanopore reads.
https://pod5-file-format.readthedocs.io/
Other
134 stars 18 forks source link

Performance issue converting fast5 -> pod5 with multiple threads #146

Open arturotorreso opened 2 weeks ago

arturotorreso commented 2 weeks ago

I am running pod5 convert fast5 on a sample with about 5000 fast5 files (from the sample samples, 4000 reads each), writing to a single pod5 per sample.

So I made subsets of the reads and compared the performance:

I thought this could be a bottleneck due to writing to the same file, but If I run two samples in the background simultaneously (thus writing to two different pod5 files) I run into the same situation of decreasing performance (similar to when using multiple threads on loads of files), and the jobs keep getting send to state D. My system should have enough memory to handle the job though.

For now I'm thinking of processing the files in batches and merging the final pod5, but I was curious to know if this is a known issue and what recommendations you have to improve performance when running multiple samples at the same time or with multiple threads.

0x55555555 commented 2 weeks ago

Hi @arturotorreso,

Just to confirm - you tested writing two files simultaneously, and confirmed it wasn't a bottleneck of writing to one file.

but you did see increased performance when running the conversion on batches of smaller files (300 being optimal)?

Can you confirm the only difference between the two tests where performance was different was number of files input to the converson script?

Can you also let me know the approximate length of the reads in the files?

Can you provide an example command line snippet you are using to trigger the conversion?

Thanks,

arturotorreso commented 2 weeks ago

Thank you for your quick response!

Just to confirm - you tested writing two files simultaneously, and confirmed it wasn't a bottleneck of writing to one file.

Yes, there was also a decreased performance when running multiple samples simultaneously and running to separate files. This was also dependent on the number of input files in each sample. If I run both samples with 300 input fast5 each, the decreased performance wasn't too bad (2000-3000 reads/s each with -t 4, versus 7000 reads/s if run separately). But if each sample was run with 5000 files, then the performance decreased to 40-50 reads/s. This does point out to a memory issue but in theory I should have enough CPUs and Mem to handle it.

but you did see increased performance when running the conversion on batches of smaller files (300 being optimal)?

Yes

Can you confirm the only difference between the two tests where performance was different was number of files input to the conversion script?

Yes

Can you also let me know the approximate length of the reads in the files?

We are working mostly with cell free DNA (~200bp), but we also find larger DNA fragments (>10kb). The read length distribution will be 216bp (157-776), but the range goes up to 37kb.

Can you provide an example command line snippet you are using to trigger the conversion?

I'm running it straight from the command line:

pod5 convert fast5 -f -t 4 -o output.pod5 input_folder/

And for subsets:

pod5 convert fast5 -f -t 4 -o output.pod5 $(ls input_folder/*.fast5 | head -n200)

Let me know if you need anything else!

arturotorreso commented 2 weeks ago

Similarly, when I run the command ls *fast5 | xargs -n200 pod5 convert fast5 -f -t 20 -o pod5_out/$RANDOM.pod5 I get a decreased in performance in the second batch, as shown in the picture. Could it be an issue of python multiprocessing not cleaning up after finishing?

Screenshot 2024-10-08 125357

0x55555555 commented 1 week ago

Do you see the decrease in performance if you run the commands sequentially manually, or with a small gap?

If you restart the terminal session and re run the experiment is it faster, or wait for a period after the run?

I'll attempt to reproduce your results here.

arturotorreso commented 1 day ago

If I run them manually, there's no decrease performance. With gaps, yes, I put a sleep of 1 minute and still saw the performance decrease.

I don't need to restart the terminal, as soon as I kill the job and restart it, it goes faster until it eventually decreases performance again.

Right now I'm running each file separately in a loop with -t 1, and merging afterwards, and it performs well.