The commit 16db7cc8902477dc4b8b5930a79903d713ec700b introduces a new issue, in fact, this patch helps to expose a big problem behind.
Below is part of this commit.
index 13b59d739c..ab366615d1 100644
--- a/shared/source/os_interface/linux/drm_memory_operations_handler_bind.cpp
+++ b/shared/source/os_interface/linux/drm_memory_operations_handler_bind.cpp
@@ -167,6 +167,10 @@ MemoryOperationsStatus DrmMemoryOperationsHandlerBind::evictUnusedAllocationsImp
for (auto &allocation : allocationsForEviction) {
bool evict = true;
+ if (allocation->getRootDeviceIndex() != this->rootDeviceIndex) {
+ continue;
+ }
+
for (const auto &engine : engines) {
if (this->rootDeviceIndex == engine.commandStreamReceiver->getRootDeviceIndex() &&
engine.osContext->getDeviceBitfield().test(subdeviceIndex)) {
When I launched multiple processes (up to 100) to run cl_gemm, exposing one card in one process using ZE_AFFINITY_MASK,
I could see many segfaults occurred.
After I did some investigations on this, I can say that,
when the number of dGPU cards specified in a workload process is less than the number of all cards in the system,
the card index (rootDeviceIndex) and the index of localMemAllocs conflict when doing eviction.
The immediate cause is that,
when we expose one card to one process environment, then only localMemAllocs[0] is initialized, but the root device index (rootDeviceIndex ) we got during eviction is the one we exposed using ZE_AFFINITY_MASK=X, it may be 1 or 2 or 3. So we cannot read localMemAllocs[rootDeviceIndex].
The commit 16db7cc8902477dc4b8b5930a79903d713ec700b introduces a new issue, in fact, this patch helps to expose a big problem behind.
Below is part of this commit.
When I launched multiple processes (up to 100) to run cl_gemm, exposing one card in one process using ZE_AFFINITY_MASK, I could see many segfaults occurred.
After I did some investigations on this, I can say that, when the number of dGPU cards specified in a workload process is less than the number of all cards in the system, the card index (rootDeviceIndex) and the index of localMemAllocs conflict when doing eviction.
segfault:
(gdb) bt full
The immediate cause is that, when we expose one card to one process environment, then only localMemAllocs[0] is initialized, but the root device index (rootDeviceIndex ) we got during eviction is the one we exposed using ZE_AFFINITY_MASK=X, it may be 1 or 2 or 3. So we cannot read localMemAllocs[rootDeviceIndex].
Example: $ ZE_AFFINITY_MASK=2 ZE_ENABLE_PCI_ID_DEVICE_ORDER=1 ./cl_gemm gpu 2048 1000
We could get the values of some key variables and they help us to see what happened.
shared/source/memory_manager/memory_manager.cpp
rootEnvCount = 1 gfxPartitions.size() = 1
./shared/source/os_interface/linux/drm_memory_manager.cpp
localMemAllocs.size() = 1 Only localMemAllocs[0] was initialized.
shared/source/os_interface/linux/drm_memory_operations_handler_bind.cpp
160 this->evictUnusedAllocationsImpl(memoryManager->getLocalMemAllocs(this->rootDeviceIndex), waitForCompletion)}) {
this->rootDeviceIndex = 2
shared/source/os_interface/linux/drm_memory_manager.cpp
return this->localMemAllocs[2]; localMemAllocs[2] was not initialized.
shared/source/os_interface/linux/drm_memory_operations_handler_bind.cpp
allocation->getRootDeviceIndex() will trigger a segfault.
When we expose two cards in one process, then only localMemAllocs[0] and localMemAllocs[1] are initialized.
Example: ZE_AFFINITY_MASK=1,2 ZE_ENABLE_PCI_ID_DEVICE_ORDER=1 ./cl_gemm gpu 2048 1000
Also triggers memory reading error during eviction.
I am glad to verify your fix patch and I will also find time to continue working on this issue.