Closed krehm closed 2 years ago
I think what we might want is an environment variable to disable huge pages by default. If that variable is set, then ofi_get_hugepage_size() can just return -FI_ENOENT. That should provide the proper fallback everywhere.
Hmmm, so this is a new environ variable? So if set, libfabric itself could never use hugepages; I thought hugepage support was added to libfabric for better performance, reduced TLB misses. One could never un-set the new variable with verbs;ofi_rxm because then the failure I reported here would just happen again. Non-verbs;ofi_rxm callers would want to unset the variable. This would have to be documented.
If the application itself regularly registers unaligned hugepage byte ranges, then it would want to unset the new variable as well because its setting of RDMAV_HUGEPAGES_SAFE already protects against failure.
Do you see harm in fixing the code the way that I mentioned? No new environ variable and corresponding documentation required. The benefits of using hugepages in libfabric are preserved.
I have an alternate solution as well.
Modify the code in ibv_madvise_range() (called by ibv_mr_reg) See line 590 in https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/memory When huge_page_enabled is false, the code today tries normal 4KiB rounding-out of the starting offset and length and then issues the madvise() call. If the call fails, the code should continue to execute the same backout code as today, but then it should test the address and length for being hugepages. If so, it should do hugepage rounding-out and then go around the loop a second time. If successful the second time, execution continues; if not, then it's a hard failure and ibv_reg_mr() fails.
The advantage here is that there is no cost at all to the application either if the application is using regular memory or if the application always huge-page aligns the address and length before calling ibv_reg_mr(). In a case where the call does present unaligned hugepage memory (the case I reported here), the new code corrects for this without requiring that RDMAV_HUGEPAGES_SAFE be set, with a latency cost similar to that of RMDAV_HUGEPAGES_SAFE but only on unaligned calls.
An application willing to hugepage-align all its requests to fi_mr_reg() should not be forced to set RMDAV_HUGEPAGES_SAFE and pay the latency cost on every fi_reg_mr call just because of the one unaligned request in the libfabric code, that's what I'm trying to solve here.
RDMAV_HUGEPAGES_SAFE is a requirement coming from libibverbs. Until that is removed, if huge pages are used at all, then it must be set. See https://linux.die.net/man/3/ibv_fork_init
Setting the environment variable RDMAV_HUGEPAGES_SAFE tells the library to check the underlying page size used by the kernel for memory regions. This is required if an application uses huge pages either directly or indirectly via a library such as libhugetlbfs.
@krehm have you ever seen this manifest as a segmentation fault, or just a registration error? We got a bug report (from a mochi/mercury/libfabric/verbs stack user) of a confusing segfault that went away when RDMAV_HUGEPAGES_SAFE was set at the recommendation of a system administrator. I don't know if this means we have a secondary error handling problem somewhere, or if that's just a possible outcome of using verbs without RDMAV_HUGEPAGES_SAFE on a hugepages system. The latter would be really frustrating.
Is there any plausible way to have the verbs provider issue a warning (pointing to verbs documentation) if hugepages are in use but the RDMAV_HUGEPAGES_SAFE env var is not set?
@carns No, the segfault is a separate bug. I was chasing it a couple of weeks ago, there is a logic error where when the registration fails, some memory is cleaned up but a pointer to it is not cleared, and then subsequent code reuses that memory turning it into garbage, and then there is a second reference to the memory that segfaults. I was going to dig a little deeper and then report it, but got overwhelmed with other tasks. What is the other bug report, does it contain a backtrace? That might be enough to trigger my memory.
There has been some recent update in kernel side that might interest you. Specifically, Peter Xu (peterx@redhat.com) has introduced a series of patches that will do COW on pinned huge page memory:
https://patchwork.kernel.org/project/linux-mm/patch/20210217233547.93892-6-peterx@redhat.com/
This patch essential implements fork support for huge page in kernel space, so the user space fork support is not needed any more.
With Peter's patch, Gal Pressment added another kernel patch:
commit 6cc9e215eb277513719c32b9ba40e5012b02db57
Author: Gal Pressman <galpress@amazon.com>
Date: Sun Apr 18 15:10:25 2021 +0300
RDMA/nldev: Add copy-on-fork attribute to get sys command
The new attribute indicates that the kernel copies DMA pages on fork,
hence libibverbs' fork support through madvise and MADV_DONTFORK is not
needed.
The introduced attribute is always reported as supported since the kernel
has the patch that added the copy-on-fork behavior. This allows the
userspace library to identify older vs newer kernel versions. Extra care
should be taken when backporting this patch as it relies on the fact that
the copy-on-fork patch is merged, hence no check for support is added.
Don't backport this patch unless you also have the following series:
commit 70e806e4e645 ("mm: Do early cow for pinned pages during fork() for
ptes") and commit 4eae4efa2c29 ("hugetlb: do early cow when page pinned on
src mm").
Fixes: 70e806e4e645 ("mm: Do early cow for pinned pages during fork() for ptes")
Fixes: 4eae4efa2c29 ("hugetlb: do early cow when page pinned on src mm")
Link: https://lore.kernel.org/r/20210418121025.66849-1-galpress@amazon.com
Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Meanwhile, rdma-core introduced a new api ibv_is_fork_initialized
which report the status, if it return UNNEEDED, you do not need to set the RDMAV_HUGEPAGES_SAFE
variable.
@wzamazon The memory pinned here is being used as the source or destination of RDMA transfers from another machine, which has the address of the pinned memory. I don't quite understand how this will work; after the fork the parent process can continue without issue. The forked process will COW only if something on that local machine writes to the memory, right? COW wouldn't work for an RDMA transfer? If you could explain the behavior of the child process in the face of an in-progress RDMA transfer, that would be helpful, thanks.
@carns No, the segfault is a separate bug. I was chasing it a couple of weeks ago, there is a logic error where when the registration fails, some memory is cleaned up but a pointer to it is not cleared, and then subsequent code reuses that memory turning it into garbage, and then there is a second reference to the memory that segfaults. I was going to dig a little deeper and then report it, but got overwhelmed with other tasks. What is the other bug report, does it contain a backtrace? That might be enough to trigger my memory.
Thanks @krehm, that's helpful to know. I've reached out to the user to ask if they can share a backtrace here.
@krehm
For kernel < 5.12, CoW was applied to pinned memory region during fork. After fork, parent and child share the memory region. If either parent
or child
(this is important) touched the shared memory region, there will be a page fault event, memory will be moved.
For kernel >= 5.12, Copy was appied to pinned memory region. After fork, child will have its own copy of the pinned memory region, so parent can change the pinned memory region
@carns No, the segfault is a separate bug. I was chasing it a couple of weeks ago, there is a logic error where when the registration fails, some memory is cleaned up but a pointer to it is not cleared, and then subsequent code reuses that memory turning it into garbage, and then there is a second reference to the memory that segfaults.
@krehm The above sounds like it could be related to userfaultfd rather than the environment variable. There was a fix introduced into the userfaultfd code in master recently that you might consider.
As far as the kernel changes go, I believe we want verbs to operate correctly regardless of the kernel version. We need to be careful here. It sounds like we have room for improvement using the new function in rdma-core (ibv_is_fork_initialized), and some potential benefit down the road with kernel changes. Is that accurate?
This issue is stale because it has been open 360 days with no activity. Remove stale label or comment, otherwise it will be closed in 7 days.
I have been testing with daos using infiniband and verbs;ofi_rxm, and because nodes may have hugepages configured I have to always define RDMAV_HUGEPAGES_SAFE to prevent ibv_reg_mr() failures, with the added performance penalty described by James Swaro in PR 4973. Looking at the code, it seems like a small change to ofi_rxm would get rid of the need for RDMAV_HUGEPAGES_SAFE.
The problem is in routine ofi_bufpool_grow() in libfabrics file prov/util/src/util_buf.c. If the call to ofi_alloc_hugepage_buf() succeeds, then the resulting buf_region->alloc_region and pool->alloc_size are properly hugepage-aligned. If those values would be used in the ibv_reg_mr() call, the default 4 KiB rounding-out done by that code would be a no-op, and so the memory registration would succeed without having to set RDMAV_HUGEPAGES_SAFE. Instead, the routine uses
buf_region->mem_region = buf_region->alloc_region + pool->entry_size;
as the starting range address, which is not hugepage-aligned. The 4 KiB rounding-out done by ibv_reg_mr() will not result in correct hugepage alignment, and hence the ibv_reg_mr() call will fail.
Specifically, the line ret = pool->attr.alloc_fn(buf_region); calls routine rxm_buf_reg(); the fix is to change the call in that routine to function rxm_msg_mr_reg_internal() to use pool->alloc_region and pool->alloc_size as parameters in place of region->mem_region and region->pool->region_size. There should be no harm in registering the additional pool->entry_size bytes at the beginning of pool->alloc_region even though they will not be used for I/O.
I'm interested in hearing any feedback on this before creating a PR and testing it.