Open david-edwards-linaro opened 6 months ago
The local_ze_device_count
here should be 1. However, the local_dev_id
includes both root device and subdevices. So I believe it is 3 here (1 root + 2 sub). Thus I don't understand the assertion. @abrooks98 maybe you can take a look and clarify the semantics of those two counts.
It seems we missed the idea that users may choose to skip using certain devices. In this case, the assertion is incorrect. I'm checking to see if it is sufficient to simply remove the assertion, or if other considerations are needed for this case
My previous comment is incorrect. The assertion is still valid in case of skipping certain devices. local_ze_device_count
includes root and subdevices. So in case of ZE_AFFINITY_MASK=1
, the correct value is 3.
While setting up the global-to-local device id mapping, local_dev_id
is incrementing the root and sub devices. So if the scheme is correct, it should also be 3 in this case. It should support setting ZE_AFFINITY_MASK
as the root device, and has worked in the past, but it seems there is a bug or missing logic.
As a workaround, please try setting ZE_AFFINITY_MASK=1.0,1.1
to ensure the root device and its sub devices are captured and pass this check. See below:
> MPIR_CVAR_ENABLE_GPU=1 ZE_AFFINITY_MASK=1.0,1.1 MPIR_CVAR_CH4_IPC_ZE_SHAREABLE_HANDLE=pidfd mpirun -n 1 -launcher ssh ./mpitest
local_dev_id: 3 | local_ze_device_count: 3
> MPIR_CVAR_ENABLE_GPU=1 ZE_AFFINITY_MASK=1 MPIR_CVAR_CH4_IPC_ZE_SHAREABLE_HANDLE=pidfd mpirun -n 1 -launcher ssh ./mpitest
local_dev_id: 1 | local_ze_device_count: 3
mpitest: src/gpu/mpl_gpu_ze.c:468: MPL_gpu_init_device_mappings: Assertion `local_dev_id == local_ze_device_count' failed.
To resolve this issue, we either need to debug and fix the logic of the ZE_AFFINITY_MASK
parsing (bandaid fix) or remove it in favor of using BDF/UUID discovery (portable long term solution).
Thanks for the suggestion, however the assert still occurs on the 2x Flex 170 (single tile per card) system I am using.
The immediate use case is a test environment for which I can patch the MPI source. Simply removing the assert line allows the program to complete, though from earlier comments this may not be a valid approach? Configuring using --with-device=ch4:ucx avoids this code path and is a further option to workaround this issue.
It turns out the issue of handling whole devices in ZE_AFFINITY_MASK
is relatively new and stems from #6929. The change causes comparing an unsigned int against an int with value -1
in this particular case, which results in the subdevices not getting properly counted. I should have a PR to fix this today.
N.b. I am using 4.2.0 which predates PR6929.
Thanks for pointing this out. I will try to find access to a Flex series GPU and continue investigating this issue.
Issue
Running MPI programs on systems with Intel(R) Data Center GPU assert when ZE_AFFINITY_MASK is set to use a second device.
Environment
O/S: SLES 15.5 CPU: 2x Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz GPU: 2x Intel(R) Data Center GPU Flex 170 MPI: MPICH 4.2.0 configured with --enable-debuginfo --enable-shared, no libdrm present
Reproducer
Save the following trivial MPI program e.g. as
mpitest.c
:Build it as follows:
mpicc -g -O0 -o mpitest mpitest.c
Run it with the affinity mask set to the second device:
ZE_AFFINITY_MASK=1 MPIR_CVAR_CH4_IPC_ZE_SHAREABLE_HANDLE=pidfd mpirun -n 1 ./mpitest
Observe assertion failure: