prov/efa: unable to register more than 95GB of memory

jhh67 commented 8 months ago

Describe the bug While developing the Chapel runtime for the EFA provider we encountered an error in which a single process cannot register more than 95GB of memory. 95GB succeeds, 96GB fails with the following error:

OFI error: fi_mr_reg(ofi_domain, memTab[i].addr, memTab[i].size, bufAcc, 0, (prov_key ? 0 : i), 0, &ofiMrTab[i], ((void*)0)): Cannot allocate memory

To Reproduce We do not have a simple reproducer, we currently test using the full Chapel runtime. We observed the error on AWS c7i.48xlarge which has one EFA NIC and 384GB of memory.

Expected behavior I expect to be able to register more than 25% of the physical memory of the machine.

Output The output with FI_LOG_LEVEL=Debug contained:

libfabric:15409:1705531788::efa:mr:efa_mr_reg_impl():850<warn> Unable to register MR: Cannot allocate memory
libfabric:15409:1705531788::efa:mr:efa_mr_regattr():982<warn> Unable to register MR: Cannot allocate memory

Environment: This is on an AWS c7i.48xlarge instance using libfabric 1.19, the efa provider, and export FI_EFA_USE_DEVICE_RDMA=1.

Additional context

j-xiong commented 8 months ago

What is the output of ulimit -l?

shijin-aws commented 8 months ago

It's not a bug. EFA device has limit for the number of host pages that you can register. If you are currently allocating your memory with the regular page (4k), using huge page (can be 2M on some platform) can save the number of pages and allow you to register larger memory.

jhh67 commented 8 months ago

Thank you for your suggestions. We will try them and get back to you with the results.

jhh67 commented 8 months ago

We haven't had any luck registering more than 95GB of memory using hugepages. Can you provide some guidance on how to make this work? ulimit -l is unlimited so that isn't the issue. We tried using explicit hugepages using libhugetlbfs but encountered errors trying to register the memory:

internal error: 0: comm-ofi.c:2875: OFI error: fi_mr_reg(ofi_domain, memTab[i].addr, memTab[i].size, bufAcc, 0, (prov_key ? 0 : i), 0, &ofiMrTab[i], ((void*)0)): Bad address
internal error: 1: comm-ofi.c:2875: OFI error: fi_mr_reg(ofi_domain, memTab[i].addr, memTab[i].size, bufAcc, 0, (prov_key ? 0 : i), 0, &ofiMrTab[i], ((void*)0)): Bad address

We also tried using transparent 2MB hugepages and mmap with MAP_HUGETLB. Using this method we are sometimes able to register up to 155GB of memory, but not always. Is there documentation on getting the efa provider working using hugepages?

shijin-aws commented 8 months ago

We also tried using transparent 2MB hugepages and mmap with MAP_HUGETLB.

I don't think EFA support transparent huge pages. If you have EFA installer installed on your instance, you should be able to see there are huge page reserved

(env) [ec2-user@ip-172-31-51-162 ~]$ cat /sys/kernel/mm/hugepages/**/nr_hugepages
0
14081

You can increase this count to allow larger size of huge page allocation

Libfabric uses this to allocate buffer from the huge page pool

*memptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);

And EFA provider allocates its internal buffer pool from the huge page pool by default. Did you use the same mmap call in your application to allocate huge page memory?

jabraham17 commented 7 months ago

Yes, we used start = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, 0, 0);

This would return (void*)-1 and error with "Cannot allocate memory". We also validated that cat /sys/kernel/mm/hugepages/**/nr_hugepages prints a non-zero value.

Another aspect we tried was adding (21 << MAP_HUGE_SHIFT) to the flags for mmap to request a specific size, this made no change.

bradcray commented 6 months ago

@shijin-aws / @j-xiong : Any further suggestions here for how to make progress? Have you successfully been able to register 96+GB of memory in your work?

j-xiong commented 6 months ago

Have you tried increasing the count at /sys/kernel/mm/hugepages/**/nr_hugepages as suggested by @shijin-aws?

j-xiong commented 6 months ago

This may be useful: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/monitoring_and_managing_system_status_and_performance/configuring-huge-pages_monitoring-and-managing-system-status-and-performance#configuring-hugetlb-at-run-time_configuring-huge-pages

shijin-aws commented 6 months ago

As I mentioned earlier, You need to increase /sys/kernel/mm/hugepages/**/nr_hugepage because the default value is only configured for efa provider's internal bounce buffer pool usage, which is way less than 96GB.

shijin-aws commented 6 months ago

OK I just make a quick test on c7i.48xlarge and make fabtests allocated a 100GB buffer backed by huge pages

diff --git a/fabtests/common/shared.c b/fabtests/common/shared.c
index fc228f4d8..ae0c5301e 100644
--- a/fabtests/common/shared.c
+++ b/fabtests/common/shared.c
@@ -630,9 +636,20 @@ int ft_alloc_msgs(void)
                buf_size += alignment;
                ret = ft_hmem_alloc(opts.iface, opts.device, (void **) &buf,
                                    buf_size);
+
+               buf_size *= 100; // buf_size was 1 GB
+               buf = mmap(NULL, buf_size, PROT_READ | PROT_WRITE,
+               MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
+               if (!buf) {
+                       FT_PRINTERR("mmap", errno);
+                       ret = -FI_ENOMEM;
+                       return ret;
+               }
+               printf("allocated memory of size %lu \n", buf_size);
                if (ret)
                        return ret;
....
        if (!ft_mr_alloc_func && !ft_check_opts(FT_OPT_SKIP_REG_MR)) {
-               ret = ft_reg_mr(fi, rx_buf, rx_buf_size + tx_buf_size,
+               ret = ft_reg_mr(fi, rx_buf, buf_size,
                                ft_info_to_mr_access(fi),
                                FT_MR_KEY, opts.iface, opts.device, &mr,
                                &mr_desc);
                if (ret)
                        return ret;
+               printf("successfully register memory for rx buf\n");

I need to increase the nr_hugepages to 61121 , which will make it reserve 2MiB * 61121 ~ 120GB memory for hugepages.

ubuntu@ip-172-31-94-227:~/PortaFiducia/build/libraries/libfabric/main/source/libfabric/fabtests$ cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
61121

And finally the registration succeeded.

ubuntu@ip-172-31-94-227:~/PortaFiducia/build/libraries/libfabric/main/source/libfabric/fabtests$ FI_LOG_LEVEL=warn fi_rdm_tagged_pingpong -p efa
allocated memory of size 107374201600
successfully register memory for rx buf

Let me know if you still have questions @bradcray

jhh67 commented 5 months ago

Resolved by https://github.com/chapel-lang/chapel/pull/24971.

jabraham17 commented 5 months ago

Just pointing out one of the issues we ran into implementing https://github.com/chapel-lang/chapel/pull/24971 due to some missing cleanup on our end. We had some missing fi_close calls when using the EFA provider, which seemed to cause subsequent runs with huge pages to fail. We would be able to run once, then running a second time would fail with an unknown error during memory registration. This appeared to us to be something not being properly released, even after the process would exit.

In summary, when the EFA teardown was not invoked, subsequent runs would fail until the compute nodes were restarted. Is this intentional?

shijin-aws commented 5 months ago

EFA provider uses MR cache for host memory by default and all mr dereg are actually deferred: it will be put into an LRU list if its use cnt is 0, or put into the dead region list if application frees the buffer. EFA domain close will cleanup the MR cache by flushing all MRs in the LRU list and dead region list. If you don't close your MRs by fi_close, I expect it should still be flushed as long as you freed it.

ofiwg / libfabric

prov/efa: unable to register more than 95GB of memory #9739