multiscale / muscle3

The third major version of the MUltiScale Coupling Library and Environment
Apache License 2.0
25 stars 13 forks source link

Possible race conditions might still be occurring in 0.7.1 ... #271

Open DavidPCoster opened 9 months ago

DavidPCoster commented 9 months ago

I ran a UQ campaign that created slurm jobs to run the various cases, where each case was a MUSCLE3 workflow. As a test, I chose to run each MUSCLE3 on only one core, rerunning the cases that failed. I ended up having to run 4554 slurm jobs to get the desired successful 1296 successful cases. The distribution of the number of times a run had to be attempted for the 1296 cases is given below

Number of cases         Runs required
    247         1
    340         2
    179         3
    133         4
    110         5
    108         6
    105         7
     41         8
     33         9

(i.e. 33 cases required 9 attempts of running before the last attempt succeeded)

Is there a requirement that MUSCLE3 run on more than one core? Or are there still some issues with race conditions or something similar?

In case it is important, the workflow had Number of components: 52 Number of connections: 133

The 4554 log files together are about 9.3 GB in size (800 million lines in total) and I'm not sure if github will allow me to attache them ...

LourensVeen commented 9 months ago

Whaat? Okay, I'm going to build that stress test this weekend and test the heck out of this and stomp it out once and for all.

The only thing I can think of other than a bug in MUSCLE3 is if many SLURM jobs run at the same time on the same node, and they're all making lots of connections. At that point, the node could run out of available TCP ports, causing MUSCLE3 to fail to connect and crash. Each of these components is going to open a listening socket, and it uses a socket for each peer and for its connection to the manager. If you had a 256-core node with one MUSCLE3 simulation on each core and 52 components, then the back of the envelope suggests that that could happen. The solution to that would be to add support for fast interconnects, which I think have different addressing systems.

Perhaps you could attach a single failed log? There's probably a lot of repetition among those failed runs.

DavidPCoster commented 9 months ago

For two cores per SLURM job:

Running the ETS workflow on 2 cores, PCE=5 (1296 cases), rerunning the cases that fail, 2859 runs to get 1296 successful cases

Number of cases         Runs required
    558         1
    357         2
    137         3
    109         4
     75         5
     55         6
      5         7
DavidPCoster commented 9 months ago

Here is one of the failed cases (for cores==2)

muscle3_manager.log

LourensVeen commented 9 months ago

Hm, doesn't seem like it's opened a lot of ports yet when it crashes, so that seems unlikely to be the problem. Too bad the backtrace doesn't show the full function names, but it's clear that it's crashing inside libmuscle. QCG-PJ timing out when trying to clean up its agents is also somewhat ominous.

I'm going to go set up a large oversubscribed simulation on my laptop and see if I can reproduce this. If not then we'll have to see if we can get some better debug output.

DavidPCoster commented 9 months ago

Fo reference purposes, when running the muscle workflow with 4 cores, all of the cases ran successfully!

LourensVeen commented 9 months ago

As a preliminary result, I can go up to around 500 instances with 500 conduits on an 8 core / 16 thread laptop before things start breaking. It seems to be running out of file descriptors for sockets and output files at that point (there are some 900-1000 established TCP connections at that point), and QCG-PJ fails to start more processes.

If I raise the ulimit, the network connections seem to end up being dropped, maybe by the kernel in order to protect the system from the overload? This leads to the manager's server threads crashing. One of the MPI models also crashed in its MPI communications, probably for the same reason. I'm not seeing anything resembling the backtrace you're getting yet, but if it crashed while connecting to the manager then some kind of network connection failure could be the culprit. Too bad Fortran prints such useless back traces.

The scenario I have here is a bit different from what you're doing, because I'm running the reaction-diffusion example many times in parallel, meaning about half my processes are actually trying to run at the same time. I'm going to try to set up a version where most of the components are idle most of the time, see if that makes a difference.

If I recall correctly, you have two active instances for most of the simulation, right? And they're both running with a single process and a single thread?

DavidPCoster commented 9 months ago

All of the processes in this version of the workflow are single threaded (I can switch out the anomalous transport actor to use either an openmp actor, an mpi actor, or an actor that requires both openmp + mpi).

Below is a pretty representative pattern for the activation of the actors:

timeline_zoom

LourensVeen commented 9 months ago

I've done some more experiments and had a look at the TCP statistics, and I'm indeed seeing TCP errors increase at the point where things start crashing. And adding some sleep statements to the micromodels helps, probably because it reduces CPU load, as I'm then seeing no TCP errors and successful runs.

I'm running 500 instances with 600 processes/threads and 500 conduits on an 8C/16T CPU, which is about the same number of processes per core as your original 52/1, but at a much higher load. It seems from the graph above that you have somewhere between 2 and 4 active threads, and you have more conduits per instance. It's possible that the CPU load causes TCP error rates to increase and connections to drop. MUSCLE3 doesn't even try to deal with network problems at all, so it just crashes when it gets in that situation.

Assuming that this is the problem (and I'm still not entirely sure, 4 active threads on a single core doesn't sound very problematic), there are a few things that could be done to improve the situation:

  1. Run on more cores to reduce loads and avoid TCP problems
  2. Check for errors in TCP connections, and reconnect and retry the operation if there is an issue
  3. Implement in-memory communication, so that we can do the local communication without TCP

Option 0 looks like it works, but may be inefficient. In the graph above, periods in which equilibrium is running by itself alternate with periods where we're doing everything else at the same time, and the extra cores are wasted while equilibrium is running.

Option 1 may improve robustness (and may finally make the CI fully reliable :smile:) but reconnecting also costs CPU time so we may trade crashes for a quagmire of connecting and reconnecting and very slow progress. It also doesn't solve the previously mentioned running-out-of-ports problem. Nevertheless, being more robust is usually a good idea. We can always log a warning.

Option 2 would reduce the number of TCP connections, but not necessarily CPU load or performance unless it is non-blocking (and that requires changes to the resource allocator as well). Also, if we have a lot of cross-node conduits, then we'd still need lots of TCP connections. For the latter, RDMA support in some form may help, if a suitable interconnect is available.

About the other issue I mentioned of running out of file descriptors, large HPC nodes probably have a pretty high limit, so that may not be a problem there. Sockets, pipes and mmap segments all take up file descriptors anyway, and we need to suspend processes and threads while waiting for data to come in so that we free up the CPU, so we will always be using a couple in each process. Possibly we could programmatically increase the soft limit if needed and permitted.

DavidPCoster commented 9 months ago

This is the data for the success rates as a function of the number of cores used for the ETS-PAF workflow ... Screenshot 2023-10-07 at 22 00 52