Change the parameters of resource_configuration.conf to speed up running locally

tiantianlili commented 11 months ago

How do I set the parameters to run faster locally? Is increasing the number of maxForks effective? Is maxForks the number of threads used? For example: process { maxForks = 30 } I was running over 400 metagenome data samples locally, and I noticed that there didn't seem to be a difference between maxForks = 30 and maxForks = 3, and according to the prompts on the terminal, even not a single sample was completed in one day. 微信图片_20231211002438

wbazant commented 11 months ago

Don't do 30 if you're running locally, 1 or 2 was a good number on my laptop since it had four cores and the fan was kind of weak, so the laptop was getting too hot when I got four bowtie's running on it.

If you're running locally, it's counterproductive to set more maxForks than the number of cores on your CPU. The pipeline might start lots of processes, but they won't be advancing as fast, since the kernel will split CPU time between them.

Here's what I would do:

at first set maxForks to 1 to make it easier to see what's going on
start the pipeline
in a separate shell, open the directory the executor is showing you - on the screenshot, the path would start with work/ed/29741d - and find the bowtie2 logs to check that it's running
you should be seeing the logs
check the output of a program like top that shows you CPU load - you should see how well it's using the resources
measure how much time one bowtie2 job needs to complete, and whether it's a big, medium, or relatively small input file

The other steps are not computationally demanding, it's just bowtie2, but 400 files might be a lot of work without a server. How big are the files?

tiantianlili commented 11 months ago

Thank you very much for your reply! The workstation configuration of the unit I am using is as follows, and there seem to be a lot of cpus available to me: Processor: Dual Intel Xeon Gold 6230R (2.1GHz, 4.0GHz Turbo,26C, 10.4GT/s 2UPI,35.75MB cache,HT(150W)) DDR4-2933 Memory: 512GB 8x64GB DDR4 2933 RDIMECC Graphics: 2 Dell RTX3090 24GB graphics cards

My total data is about 4.3 Tb

I ran pipline last night with maxForks = 5 and it seems to be going pretty well. I wonder if I can speed up the pipline. I check the directory on the screenshot, but didnt find bowtie2 logs. 微信图片_20231211094351

I upload a file start with trace. trace-20231210-55261399.txt

wbazant commented 11 months ago

Nice, you have 26 cores! I think bowtie2 has the capacity to use multiple cores per job, but I am not sure how many cores it uses when ran by nextflow. You have three pieces there:

Nextflow decides how many jobs should be running concurrently based on the config
each bowtie2 job will try to use resources efficiently, but doesn't know about the global picture, so might do stupid decisions (I no longer remember what the default behaviour is)
the kernel scheduler tries to be fair when allocating CPU time between running processes

If you override bowtie2 command to use a single core, and set maxForks = 15, it could be quite a good config. Or bump maxForks to 10 and don't worry about the details, it should be pretty good as well.

To see the logs, do ls -a instead of ls, the stdout, stderr, and a .sh file reproducing what was actually run, all start with a dot in nextflow.

Finally I see some of your input files are much bigger than others - this is fine, but what you'll see in the pipeline is that some jobs will be fast, and some will take a while to run. If you sort the input files by input size, it might do the biggest ones first and that order of computation will smooth things out at the end of the run, but it's a small tweak.

tiantianlili commented 11 months ago

Thank you very much for your answer and help, pipline is running smoothly, hope there will be good results! Thank you again for developing the wonderful pipline!

wbazant / CORRAL

Change the parameters of resource_configuration.conf to speed up running locally #8