google / deepconsensus

DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.
BSD 3-Clause "New" or "Revised" License
222 stars 37 forks source link

general performance question and --cpu flag is not working #18

Closed MartinPippel closed 2 years ago

MartinPippel commented 2 years ago

Hi Developers,

first thanks for the great tool. It really improves the reads and the assemblies as well.

However the compute requirements are still challenging. We do have a Intel(R) Xeon(R) CPU E5-2680 v3 nodes with 24 cores. By default the cpu version requests 23 cores, but it never achieves the 23 threads. On average the jobs run with 12-16 threads in total, because many threads are stalling (waiting for IO?). Do you have any suggestions of how to improve the performance? Copying the input data to a local ssd disc or changing batch_zmws or batch_size did not change anything. Do you think copying the data into /dev/shm could help?

I also tried to change the number of cpus with --cpu flag. But that does not work. Deepconsensus always uses #-cores-1 many threads, even if the log reports a change according to --cpu user input.

I tried the gpu version as well. Is there an option to restrict the number of gpu to use? We do have to 2 gpu's node. But sometime's other users are using one of them. Any advice would be highly appreciated.

Thanks, Martin

AhmedArslan commented 2 years ago

Regarding the --cpus use, I have the same observation.

MariaNattestad commented 2 years ago

Thanks for the observation and sorry this isn't working as expected. The explanation is that --cpus only affects the number of processes used by the preprocessing step, while the TensorFlow step that runs the model just takes all the CPUs available. You can see this if you set the --batch_zmws very high and count processes while it's still doing the preprocessing on the first batch, which should obey the --cpus parameter.

We will look into how to modify the number of CPUs used by TensorFlow to make this more consistent, and assuming we can find a solution, we will push this out in the next release.

kishwarshafin commented 2 years ago

Hi @MartinPippel ,

Can you please try this with v0.3 to see if the issue is still there?

pichuan commented 2 years ago

Hi @MartinPippel and @AhmedArslan , If you still have the issue with the latest version (v0.3.1), please feel free to reach out again! I'll close this issue now.