Small simulations take a long time on my cluster

andrea-begnoni commented 2 months ago

Hi,

I am currently doing some tests with CONCEPT on a cluster, and I've found that when I reach a resolution of $ N\approx L^3 $ the code becomes very slow and the load imbalance reaches very high values (up to 4000%). Moreover the time to complete a run is much higher than the one I would expect from Figure 9 of the CONCEPT 1.0 paper (where it is presented the time for a run as function of the resolution, link ).

I present here some tests that I performed varying resolution and number of cpus (N=128^3 was fixed):

-n 64 L=1000 Mpc/h : 6 minutes
- -n 128 L=1000 Mpc/h : 10 minutes
- -n 64 L=512 Mpc/h : 55 sec
- -n 128 L=512 Mpc/h : 17:40 minutes
- -n 32 L=512 Mpc/h : 1:05 minutes
- -n 64 L=256 Mpc/h : 4:01 minutes
- -n 128 L=256 Mpc/h : 29:11 minutes
- -n 64 L=128 Mpc/h : stopped after 36 hrs
- -n 16 L=128 Mpc/h : stopped after 36 hrs

Is this behaviour expected? All these tests have been performed on a cluster where the job system is condor so I could not use the CONCEPT job system. Thus I resolved to enter in the job and then launch CONCEPT with the --local option

Thanks in advance for any help anyone could provide.

jmd-dk commented 2 months ago

Hi @andrea-begnoni,

This looks curious. You are indeed running with very many CPU cores compared to the work load. For N = 128³, a single or a couple of CPU cores should suffice. What simulation times do you get for -n 1 and -n 2?

For high-resolution simulations, CONCEPT do have a problem with load imbalance, as shown in the CONCEPT 1.0 paper. However, 4000% is much more extreme than expected (especially since none of your simulations are very high-resolution). It probably helps to reduce the number of CPU cores (-n) and/or increase the simulation size (N or what I typically call _size in parameter files).

It might also be the case that launching using --local does something bad on your cluster. When doing this, do you make sure to allocate the same number of cores for the job as specified in -n? Does the problem happen even when using just a single node?

When installing CONCEPT on clusters, it's crucially to point the installation towards an MPI installation native to the cluster, rather than letting the installation install its own MPI. See this. If you did not do this, you should try installing CONCEPT on your cluster from scratch using this method.

Can I see the exact parameter file you are using?

I do not have access to a cluster running condor, so I cannot test this directly myself. Let me know if you find out something from the above.

andrea-begnoni commented 4 weeks ago

Hi @jmd-dk,

Thanks a lot for your reply. In the end my issue was not caused by CONCEPT itself nor its relation with the cluster. I was providing some bad initial conditions, in which there was too much clustering too early and the time to run the short-range force exploded. My bad. Thank you again for your answer.

Best wishes

jmd-dk / concept

Small simulations take a long time on my cluster #11