Closed jhh67 closed 5 months ago
What is the output of ulimit -l
?
It's not a bug. EFA device has limit for the number of host pages that you can register. If you are currently allocating your memory with the regular page (4k), using huge page (can be 2M on some platform) can save the number of pages and allow you to register larger memory.
Thank you for your suggestions. We will try them and get back to you with the results.
We haven't had any luck registering more than 95GB of memory using hugepages. Can you provide some guidance on how to make this work? ulimit -l
is unlimited
so that isn't the issue. We tried using explicit hugepages using libhugetlbfs
but encountered errors trying to register the memory:
internal error: 0: comm-ofi.c:2875: OFI error: fi_mr_reg(ofi_domain, memTab[i].addr, memTab[i].size, bufAcc, 0, (prov_key ? 0 : i), 0, &ofiMrTab[i], ((void*)0)): Bad address
internal error: 1: comm-ofi.c:2875: OFI error: fi_mr_reg(ofi_domain, memTab[i].addr, memTab[i].size, bufAcc, 0, (prov_key ? 0 : i), 0, &ofiMrTab[i], ((void*)0)): Bad address
We also tried using transparent 2MB hugepages and mmap
with MAP_HUGETLB
. Using this method we are sometimes able to register up to 155GB of memory, but not always. Is there documentation on getting the efa
provider working using hugepages?
We also tried using transparent 2MB hugepages and mmap with MAP_HUGETLB.
I don't think EFA support transparent huge pages. If you have EFA installer installed on your instance, you should be able to see there are huge page reserved
(env) [ec2-user@ip-172-31-51-162 ~]$ cat /sys/kernel/mm/hugepages/**/nr_hugepages
0
14081
You can increase this count to allow larger size of huge page allocation
Libfabric uses this to allocate buffer from the huge page pool
*memptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
And EFA provider allocates its internal buffer pool from the huge page pool by default. Did you use the same mmap call in your application to allocate huge page memory?
Yes, we used start = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, 0, 0);
This would return (void*)-1
and error with "Cannot allocate memory". We also validated that cat /sys/kernel/mm/hugepages/**/nr_hugepages
prints a non-zero value.
Another aspect we tried was adding (21 << MAP_HUGE_SHIFT)
to the flags for mmap to request a specific size, this made no change.
@shijin-aws / @j-xiong : Any further suggestions here for how to make progress? Have you successfully been able to register 96+GB of memory in your work?
Have you tried increasing the count at /sys/kernel/mm/hugepages/**/nr_hugepages
as suggested by @shijin-aws?
As I mentioned earlier, You need to increase /sys/kernel/mm/hugepages/**/nr_hugepage
because the default value is only configured for efa provider's internal bounce buffer pool usage, which is way less than 96GB.
OK I just make a quick test on c7i.48xlarge and make fabtests allocated a 100GB buffer backed by huge pages
diff --git a/fabtests/common/shared.c b/fabtests/common/shared.c
index fc228f4d8..ae0c5301e 100644
--- a/fabtests/common/shared.c
+++ b/fabtests/common/shared.c
@@ -630,9 +636,20 @@ int ft_alloc_msgs(void)
buf_size += alignment;
ret = ft_hmem_alloc(opts.iface, opts.device, (void **) &buf,
buf_size);
+
+ buf_size *= 100; // buf_size was 1 GB
+ buf = mmap(NULL, buf_size, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
+ if (!buf) {
+ FT_PRINTERR("mmap", errno);
+ ret = -FI_ENOMEM;
+ return ret;
+ }
+ printf("allocated memory of size %lu \n", buf_size);
if (ret)
return ret;
....
if (!ft_mr_alloc_func && !ft_check_opts(FT_OPT_SKIP_REG_MR)) {
- ret = ft_reg_mr(fi, rx_buf, rx_buf_size + tx_buf_size,
+ ret = ft_reg_mr(fi, rx_buf, buf_size,
ft_info_to_mr_access(fi),
FT_MR_KEY, opts.iface, opts.device, &mr,
&mr_desc);
if (ret)
return ret;
+ printf("successfully register memory for rx buf\n");
I need to increase the nr_hugepages to 61121 , which will make it reserve 2MiB * 61121 ~ 120GB memory for hugepages.
ubuntu@ip-172-31-94-227:~/PortaFiducia/build/libraries/libfabric/main/source/libfabric/fabtests$ cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
61121
And finally the registration succeeded.
ubuntu@ip-172-31-94-227:~/PortaFiducia/build/libraries/libfabric/main/source/libfabric/fabtests$ FI_LOG_LEVEL=warn fi_rdm_tagged_pingpong -p efa
allocated memory of size 107374201600
successfully register memory for rx buf
Let me know if you still have questions @bradcray
Resolved by https://github.com/chapel-lang/chapel/pull/24971.
Just pointing out one of the issues we ran into implementing https://github.com/chapel-lang/chapel/pull/24971 due to some missing cleanup on our end. We had some missing fi_close
calls when using the EFA provider, which seemed to cause subsequent runs with huge pages to fail. We would be able to run once, then running a second time would fail with an unknown error during memory registration. This appeared to us to be something not being properly released, even after the process would exit.
In summary, when the EFA teardown was not invoked, subsequent runs would fail until the compute nodes were restarted. Is this intentional?
EFA provider uses MR cache for host memory by default and all mr dereg are actually deferred: it will be put into an LRU list if its use cnt is 0, or put into the dead region list if application frees the buffer. EFA domain close will cleanup the MR cache by flushing all MRs in the LRU list and dead region list. If you don't close your MRs by fi_close, I expect it should still be flushed as long as you freed it.
Describe the bug While developing the Chapel runtime for the EFA provider we encountered an error in which a single process cannot register more than 95GB of memory. 95GB succeeds, 96GB fails with the following error:
To Reproduce We do not have a simple reproducer, we currently test using the full Chapel runtime. We observed the error on AWS
c7i.48xlarge
which has one EFA NIC and 384GB of memory.Expected behavior I expect to be able to register more than 25% of the physical memory of the machine.
Output The output with
FI_LOG_LEVEL=Debug
contained:Environment: This is on an AWS
c7i.48xlarge
instance using libfabric 1.19, theefa
provider, andexport FI_EFA_USE_DEVICE_RDMA=1
.Additional context