[4.2.0] Assert in mpl_gpu_ze.c:466 when ZE_AFFINITY_MASK set to second device

david-edwards-linaro commented 6 months ago

Issue

Running MPI programs on systems with Intel(R) Data Center GPU assert when ZE_AFFINITY_MASK is set to use a second device.

Environment

O/S: SLES 15.5 CPU: 2x Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz GPU: 2x Intel(R) Data Center GPU Flex 170 MPI: MPICH 4.2.0 configured with --enable-debuginfo --enable-shared, no libdrm present

Reproducer

Save the following trivial MPI program e.g. as mpitest.c:

#include <mpi.h>

int main(int argc, char** argv)
{
    MPI_Init(&argc, &argv);
    MPI_Finalize();
    return 0;
}

Build it as follows: mpicc -g -O0 -o mpitest mpitest.c

Run it with the affinity mask set to the second device: ZE_AFFINITY_MASK=1 MPIR_CVAR_CH4_IPC_ZE_SHAREABLE_HANDLE=pidfd mpirun -n 1 ./mpitest

Observe assertion failure:

mpitest: mpich-4.2.0/src/mpl/src/gpu/mpl_gpu_ze.c:466: int MPL_gpu_init_device_mappings(int, int): Assertion `local_dev_id == local_ze_device_count' failed.

hzhou commented 6 months ago

The local_ze_device_count here should be 1. However, the local_dev_id includes both root device and subdevices. So I believe it is 3 here (1 root + 2 sub). Thus I don't understand the assertion. @abrooks98 maybe you can take a look and clarify the semantics of those two counts.

abrooks98 commented 6 months ago

It seems we missed the idea that users may choose to skip using certain devices. In this case, the assertion is incorrect. I'm checking to see if it is sufficient to simply remove the assertion, or if other considerations are needed for this case

abrooks98 commented 6 months ago

My previous comment is incorrect. The assertion is still valid in case of skipping certain devices. local_ze_device_count includes root and subdevices. So in case of ZE_AFFINITY_MASK=1, the correct value is 3.

While setting up the global-to-local device id mapping, local_dev_id is incrementing the root and sub devices. So if the scheme is correct, it should also be 3 in this case. It should support setting ZE_AFFINITY_MASK as the root device, and has worked in the past, but it seems there is a bug or missing logic.

As a workaround, please try setting ZE_AFFINITY_MASK=1.0,1.1 to ensure the root device and its sub devices are captured and pass this check. See below:

> MPIR_CVAR_ENABLE_GPU=1 ZE_AFFINITY_MASK=1.0,1.1 MPIR_CVAR_CH4_IPC_ZE_SHAREABLE_HANDLE=pidfd mpirun -n 1 -launcher ssh ./mpitest
local_dev_id: 3 | local_ze_device_count: 3

> MPIR_CVAR_ENABLE_GPU=1 ZE_AFFINITY_MASK=1 MPIR_CVAR_CH4_IPC_ZE_SHAREABLE_HANDLE=pidfd mpirun -n 1 -launcher ssh ./mpitest
local_dev_id: 1 | local_ze_device_count: 3
mpitest: src/gpu/mpl_gpu_ze.c:468: MPL_gpu_init_device_mappings: Assertion `local_dev_id == local_ze_device_count' failed.

To resolve this issue, we either need to debug and fix the logic of the ZE_AFFINITY_MASK parsing (bandaid fix) or remove it in favor of using BDF/UUID discovery (portable long term solution).

david-edwards-linaro commented 6 months ago

Thanks for the suggestion, however the assert still occurs on the 2x Flex 170 (single tile per card) system I am using.

The immediate use case is a test environment for which I can patch the MPI source. Simply removing the assert line allows the program to complete, though from earlier comments this may not be a valid approach? Configuring using --with-device=ch4:ucx avoids this code path and is a further option to workaround this issue.

abrooks98 commented 6 months ago

It turns out the issue of handling whole devices in ZE_AFFINITY_MASK is relatively new and stems from #6929. The change causes comparing an unsigned int against an int with value -1 in this particular case, which results in the subdevices not getting properly counted. I should have a PR to fix this today.

david-edwards-linaro commented 6 months ago

N.b. I am using 4.2.0 which predates PR6929.

abrooks98 commented 6 months ago

Thanks for pointing this out. I will try to find access to a Flex series GPU and continue investigating this issue.

pmodels / mpich