Open paboyle opened 3 years ago
@paboyle thanks. Did you intend to share a reproducer? I dont see a document attached.
Hi, intended you to ask Camilo or Patrick for it - they have access. But there's nothing secret as this API is on GitHub, so will attach.
Sorry for the run around.... all sorts of things forced me to bounce a file through a GitHub repository.
@paboyle we talked internally about this solution. As you mention above:
Sufficient data, copied by value to obtain the IPC file descriptor can be copied by value between MPI processes without use of a Unix domain socketpair or filedescriptor passing.
This would work only in the MPI scenario, where the jobs belong to the same user, and yes, it would simplify transmission of the IPC handle.
I see this as an optimization, and we have it in our internal backlog for future development. Thanks.
Source: reproducer_IPC_bandwidth.cpp
Problem: zeMemOpenIpcHandle and zeMemGetIpcHandle do not return a handle that can be copied by value between processes and used in a distinct process using either MPI or any other means of communication between processes.
Level zero example codes like zello_ipc_copy_dma_buf.cpp reinterpret the first four bytes of the opaque handle as a file descriptor and pass this descriptor through Unix domain sockets between processes. The value in the receiving process differs in general and is inserted into the first four bytes of an (uninitialized) handle in the receiving process and used as a key to open the Ipc memory window.
MPI has no facilities for file descriptor passing, and MPI process creation does not have an opportunity for socketpair and fork, making this mechanism problematic in an HPC or MPI environment.
See lines 24:76 of Level Zero “zello_ipc_copy_dma_buf.cpp” and routines: static int sendmsg_fd(int socket, int fd) and static int recvmsg_fd(int socket)
for the cumbersome implementation.
Proposed solution:
L284 through L358.
Sufficient data, copied by value to obtain the IPC file descriptor can be copied by value between MPI processes without use of a Unix domain socketpair or filedescriptor passing.
The data structure
Containing both the source process PID and the FD number within that PID can be used in combination with the Linux system call “pidfd_getfd” can use this pair to obtain the same file descriptor in another process with the same UID or sufficient permissions.
This code demonstrates transmission of a clone_mem_t between MPI ranks and the opening of the IPC file descriptor (using pidfd_getfd) in the receiving process and then passing this into the Level Zero IPC Routine.
This represents a proof of principle, that if the opaque IPC handle used by Level Zero instead contains an encoding of “PID” and “FD” then the Level Zero IPC API handles could then be copied simply by value, through MPI or otherwise and passed to the IPC open routine without any additional and unnecessary complexity.
Code available via Camilo Moreno or Patrick Steinbrecher @ intel.
Realistically this is enough to go in, suspect. Man pidf_getfd under Linux.
https://man7.org/linux/man-pages/man2/pidfd_getfd.2.html