openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.17k stars 428 forks source link

Odd ucx memory allocation failure on some nodes but not others in a cluster #8702

Open reid-a opened 2 years ago

reid-a commented 2 years ago

Describe the bug

I run a Debian 11 cluster using the stock Debian 11 OpenMPI and UCX, and am seeing some very odd behavior. On some but not all of the nodes, user jobs are reporting an error:

[1668452383.309006] [<hostname>:42159:0]          mpool.c:192  UCX  ERROR Failed to allocate memory pool (name=ud_recv_skb) chunk: Input/output error
[1668452383.312273] [<hostname>:42159:0]          ib_md.c:348  UCX  ERROR ibv_reg_mr(address=0x7feb52200000, length=20971520, access=0xf) failed: Cannot allocate memory

(This is repeated very rapidly.)

On nodes where this is happening, I also get anomalous output from ucx_info -d, it gives seemingly normal output (the same as on the unaffected nodes), but then in the Transport: ud_verbs section, it runs into apparently the same error -- a fragment of the output is:

#      Transport: ud_verbs
#         Device: mlx4_0:1
#  System device: 0000:02:00.0 (0)
[1668463150.091321] [<hostname>:2039474:0]          ib_md.c:348  UCX  ERROR ibv_reg_mr(address=0x7fa299e10000, length=544768, access=0xf) failed: Cannot allocate memory
[1668463150.091422] [<hostname>:2039474:0]          mpool.c:192  UCX  ERROR Failed to allocate memory pool (name=ud_recv_skb) chunk: Input/output error

... after which the errors repat rapidly until I ctrl-c out of the ucx_info command.

I have checked that the locked-memory ulimit -l thing is done right, it's unlimited across the full system -- many of the Google hits on this error land on that, but that much at least seems to be right.

This is happening apparently randomly across a subset of nodes in a cluster, where all of the nodes are believed to be identically configured. They vary in their IB HCA models and in the details of the systems, but there are both good and bad nodes within each sub-unit of the cluster, where the HCA model and machine type is identical within the sub-unit.

On those nodes, it seems that any newly-dispatched jobs that use MPI do the same thing.

I'm totally willing to believe that I've fumbled the config, but I am utterly mystified as to why some systems do this and some do not. Is it possible to "use up" some UCX memory resource, so that once a misbehaving application has done this, the machine is in a state where all future tasks, inlcuding the ucx_info -d task, are doomed to fail?

I have not had a chance to test whether rebooting a node clears this status -- it's a production cluster, and some jobs are still working, so I'm reluctant to blow them up, but I have several problem nodes in "drain" state and should have that opportunity soon.

Steps to Reproduce


### Setup and versions
- OS version: Debian 11, x86_64 (CPUs are all 2nd-gen Epyc, fwiw).
   - `uname -a` -> `Linux <hostname> 5.10.0-19-amd64 #1 SMP Debian 5.10.149-1 (2022-10-17) x86_64 GNU/Linux`

- For RDMA/IB/RoCE related issues:

dpkg -l "verb" Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++========================================================================================================== ii ibverbs-providers:amd64 33.2-1 amd64 User space provider drivers for libibverbs ii ibverbs-utils 33.2-1 amd64 Examples for the libibverbs library ii libibverbs-dev:amd64 33.2-1 amd64 Development files for the libibverbs library ii libibverbs1:amd64 33.2-1 amd64 Library for direct userspace use of RDMA (InfiniBand/iWARP)



### Additional information (depending on the issue)
- OpenMPI version is 4.1.0 (again, using the Debian-packaged one)
- Output of `ucx_info -d` to show transports and devices recognized by UCX
Attached from a working system -- as I mentioned, the problem systems can't complete this operation.
[ucxinfo-d.txt](https://github.com/openucx/ucx/files/10007492/ucxinfo-d.txt)

No config data, as I didn't build it, it's whatever the Debian folks do, happy to find that if it's useful.

Edit: Fixed unclosed backtick.
gleon99 commented 2 years ago

Can you please post the output of:

ulimit -S -l
ulimit -H -l
sysctl vm.max_map_count

What is the value of UCX_IB_RCACHE_MAX_REGIONS? Anything suspicious in dmesg / syslog?

reid-a commented 2 years ago

Thanks for your reply!

The ulimits are the same for users, root, and SLURM-dispatched sessions:

# ulimit -S -l
unlimited
# ulimit -H -l
unlimited

We did have this issue earlier, but the ucx errors seem to be a separate problem.

# sysctl vm.max_map_count
vm.max_map_count = 65530

This number is the same on both affected and unaffected nodes, although I can imagine it might be a limit that jobs are crashing into?

I don't seem to have the UCX_IB_RCACHE_MAX_REGIONS variable, it's not in the environment and it's not in the output of ucx_info -c.

# ucx_info -c | grep RCACHE
UCX_RCACHE_CHECK_PFN=0
UCX_IB_RCACHE_MEM_PRIO=1000
UCX_IB_RCACHE_OVERHEAD=0.18us
UCX_IB_RCACHE_ADDR_ALIGN=16

There is nothing obvious in dmesg or syslog. One of the machines says it detected an "unsupported cable" for its mlx_4 device at init-time, but there are other affected machines that don't have this error. One of the nodes is apparently having some disk ssues, there are a lot of ata5 reset messages, but again, other nodes without this feature nevertheless do have the ucx issue.

An additional possibly-useful data point is that the cluster was recently upgraded from Debian 9 (by re-imaging the nodes, not an in-place upgrade), and similar jobs worked on the old cluster, so if it's a resource limit, then it would have to be one that changed from Debian 9 to 11 (or that I forgot to set when setting up the Debian 11 systems...). The memlock thing is in this category, of course, but as I say, that one seems to be fixed.

Also, following my initial post, I have subsequently verified that manually-launched MPI jobs on the affected nodes also exhibit the error, which suggests that it's not a SLURM issue. Weirdly (to me), manually-launched jobs exhibit the error even if you do mpirun --mca btl self <args>.

reid-a commented 2 years ago

Additional addendum: Trawling through the dmesg, I was reminded that I'd seen some of these associated with a particular application:

[ <timestamp>] hugetlbfs: <application-name> (<pid>): Using mlock ulimits for SHM_HUGETLB is deprecated.

Feels related. (Props to my colleague @tkphd for this find.)

Edit: Actually, this warning occurs on many of the systems, including systems not exhibiting the ucx memory error. Probably worth some attention at some point, but not the issue today.

reid-a commented 2 years ago

More info (apologies for the dribble of info, I am learning things....)

Low-level tests of the IB fabric seem OK -- I can do ibping between nodes, including affected nodes, and get good latency numbers. I can also do export UCX_TLS=sm mpirun -np <n> <test-executable> locally on an affected node, and it runs to completion without errors. This is a useful workaround for jobs that only need to run on one node.

But export UCX_TLS=ud mpirun -np <n> <test-executable> has all the errors again.

So apparently it's not that the fabric is broken, it's that the ud transport is broken?

brminich commented 2 years ago

Probably the error is caused by many unexpected messages generated by the app, so that UCX has to allocate/register many receive buffers. It is still present with --mca btl self, because OMPI uses pml ucx by default. UCX_IB_RCACHE_MAX_REGIONS is introduced in UCX v1.11, so you may try to limit amount of registered memory if you upgrade UCX. As an alternative, can you please try to increase max VM areas (e. g. by sudo sysctl -w vm.max_map_count=262144)?

reid-a commented 2 years ago

Thanks for the suggestion.

On an affected node, doing sysctl -w vm.max_map_count=262144 did not immediately clear the condition, the ucx_info -d diagnostic test still goes into an error loop.

Building a new MPI/UCX stack is do-able, but it might be hard to assess its effectiveness -- we still do not know how the machines are getting into this state. It makes sense that it's an interaction between an application and some kind of system limit. Scoping it down from there seems like the current best path forward.

reid-a commented 2 years ago

Additional news: At least one, and possibly two of the affected machines seem to have cleared this condition on their own, even before they were "drianed" by the queuing system. In the best-observed case, it coincided roughly with a job completing, which reinforces my emerging sense that this is an interaction between cluster jobs and some kind of misconfigured resource limit.

reid-a commented 2 years ago

Minor addition (again, thanks to those bearing with me thinking out loud on this issue page...), I have seen a node transition in to the problem state after a few jobs were dispatched on to it. There is no corresponding dmesg traffic at all, and nothing obvious in the syslog, but there are some new constraints. The node is nowhere near out of total application memory (though I understand it may be out of certain kinds of memory, like buffers or something), and it's not over-subscribed.

Possibly we can isolate a different node and try dispatching similar jobs to these and maybe learn something.

yosefe commented 2 years ago

@reid-a can you please try running ib_write_bw test with 40MB buffer size, and use tasket to set CPU affinity to 0 on both nodes? something like taskset -c 0 ib_write_bw -s $((20*1024*1024)) [server-hostname] looks like something on the system is limiting memory registration.

reid-a commented 2 years ago

Thanks for the suggestion.

Even without the taskset, I am seeing some issues on this test.

Between unaffected nodes (tests where neither node in the test pair gives the memory error on ucx_info -d), the test works at the suggested 2010241024 size. But on affected nodes (tests where both nodes in the test pair give the memory error on ucx_info -d), even for the default message size of 1, I get:

Couldn't allocate MR
failed to create mr
Failed to create MR
 Couldn't create IB resources

(This is using v5.96 of ib_write_bw, which is the Debian-packaged version from the perftest package in Debian 11.)

It's hard to imagine that there's a software configuration difference between nodes, they were imaged from the same base file set. I suspect the difference is that some resource has been stressed by a running application.

The ib_write_bw tool does not seem to have a "verbose" flag that might tell me more.

Edit to add: I also tried fi_pingpong as a possibly-useful low-level test. Between affected nodes, the server end of the test gets:

fi_mr_reg(): util/pingpong.c:1336, ret=-12 (Cannot allocate memory)

The fi_pingpong test works between unaffected nodes.

In the mixed case (one affected and one unaffected node in the test pair) the test also fails, but in both directions, the "cannot allocate memory" message occurs only on the affected node.

ibping works between affected, unaffected, or mixed nodes.

shamisp commented 2 years ago

you may have some zombie processes running on impacted nodes and consuming resources.

reid-a commented 2 years ago

@shamisp Zombie processes are definitely possible, but I am not seeing any evidence -- checking htop shows no processes in Z state, and the process list looks sane. I think it's more likely to be a non-zombie process over-subscribing something?

Subsequent web searches have turned up a few other things, there's Issue 4365 which suggests enabling the IOMMU and setting it to "pass-through" (with iommu=pt boot-time argument), and a mailing list post which suggests upgrading to a newer UCX and using UCX_IB_RCACHE_MAX_REGIONS=inf. Both of these interventions will require an outage on a test node, which is do-able but not trivial.

Interestingly, it looks like both of those answers are from @yosefe.

yosefe commented 2 years ago

Thanks for the suggestion.

Even without the taskset, I am seeing some issues on this test.

Between unaffected nodes (tests where neither node in the test pair gives the memory error on ucx_info -d), the test works at the suggested 20_1024_1024 size. But on affected nodes (tests where both nodes in the test pair give the memory error on ucx_info -d), even for the default message size of 1, I get:

Couldn't allocate MR
failed to create mr
Failed to create MR
 Couldn't create IB resources

@reid-a this shows the root cause is not coming in UCX, but rather a system or driver configuration. In order to analyze it further, pls submit an issue with NVIDIA networking support.

ke3vin commented 1 year ago

Did you ever find a resolution to this? We're having the same issue on our cluster. It appears most often when one researcher is running a number of VASP processes on the node. Once whatever mysterious limit is reached, anything else that wants to touch IB gets the errors described previously. Once the VASP jobs finish, all returns to normal. Unfortunately Nvidia won't help us because we purchased the IB cards from Dell, and Dell's response to the issue has been less than stellar.

reid-a commented 1 year ago

We never did get a proper resolution.

In our case, the problem traffic seemed to be high-core-count single-node jobs, so what we actually did was reconfigure the SLURM system so that single-node jobs don't use the UCX transport. After that, the problem has not recurred.

If it does, the next thing we're going to try is to update the UCX version (along with OpenMPI) to one that supports the UCX_IB_RCACHE_MAX_REGIONS environment variable, and use that to get around the firmware limitations. This is moderately high effort, as it involves building OpenMPI and dependencies from source, but from what I have read in relation to this issue, I think it's likely to work.