root device index and index of localMemAllocs conflict during eviction, and that will cause segfaults

The commit 16db7cc8902477dc4b8b5930a79903d713ec700b introduces a new issue, in fact, this patch helps to expose a big problem behind.

Below is part of this commit.

index 13b59d739c..ab366615d1 100644
--- a/shared/source/os_interface/linux/drm_memory_operations_handler_bind.cpp
+++ b/shared/source/os_interface/linux/drm_memory_operations_handler_bind.cpp
@@ -167,6 +167,10 @@ MemoryOperationsStatus DrmMemoryOperationsHandlerBind::evictUnusedAllocationsImp
         for (auto &allocation : allocationsForEviction) {
             bool evict = true;

+            if (allocation->getRootDeviceIndex() != this->rootDeviceIndex) {
+                continue;
+            }
+
             for (const auto &engine : engines) {
                 if (this->rootDeviceIndex == engine.commandStreamReceiver->getRootDeviceIndex() &&
                     engine.osContext->getDeviceBitfield().test(subdeviceIndex)) {

When I launched multiple processes (up to 100) to run cl_gemm, exposing one card in one process using ZE_AFFINITY_MASK, I could see many segfaults occurred.

After I did some investigations on this, I can say that, when the number of dGPU cards specified in a workload process is less than the number of all cards in the system, the card index (rootDeviceIndex) and the index of localMemAllocs conflict when doing eviction.

segfault:

(gdb) bt full

#0  NEO::GraphicsAllocation::getRootDeviceIndex (this=<optimized out>) at ./shared/source/memory_manager/graphics_allocation.h:74
No locals.
#1  NEO::DrmMemoryOperationsHandlerBind::evictUnusedAllocationsImpl (this=0x55d485d6d030,
    allocationsForEviction=std::vector of length -9007199793717275, capacity -8995403385361004 = {...}, waitForCompletion=false)
    at ./shared/source/os_interface/linux/drm_memory_operations_handler_bind.cpp:178
        evict = true
        **allocation = <error reading variable: Cannot access memory at address 0x100000101010100>**
        **__for_range = std::vector of length -9007199793717275, capacity -8995403385361004 = {**
          <error reading variable __for_range (Cannot access memory at address 0x100000101010100)>
        __for_begin = <optimized out>
        __for_end = <optimized out>
        subdeviceIndex = 0
        engines = std::vector of length 0, capacity 0
        evictCandidates = std::vector of length 0, capacity 0
#2  0x00007feacc467a05 in NEO::DrmMemoryOperationsHandlerBind::evictUnusedAllocations (this=0x55d485d6d030, waitForCompletion=<optimized out>,
    isLockNeeded=<optimized out>) at ./shared/source/os_interface/linux/drm_memory_operations_handler_bind.cpp:160
        status = <optimized out>
        __for_range = <optimized out>
        __for_begin = <optimized out>
        __for_end = <optimized out>
        memoryManager = 0x55d485d65cf0
        evictLock = {_M_device = 0x55d485d6d038, _M_owns = true}
        allocLock = {_M_device = 0x55d485d66158, _M_owns = true}

The immediate cause is that, when we expose one card to one process environment, then only localMemAllocs[0] is initialized, but the root device index (rootDeviceIndex ) we got during eviction is the one we exposed using ZE_AFFINITY_MASK=X, it may be 1 or 2 or 3. So we cannot read localMemAllocs[rootDeviceIndex].

Example: $ ZE_AFFINITY_MASK=2 ZE_ENABLE_PCI_ID_DEVICE_ORDER=1 ./cl_gemm gpu 2048 1000

We could get the values of some key variables and they help us to see what happened.

shared/source/memory_manager/memory_manager.cpp

52     const auto rootEnvCount = executionEnvironment.rootDeviceEnvironments.size();
59     for (uint32_t rootDeviceIndex = 0; rootDeviceIndex < rootEnvCount; ++rootDeviceIndex) {
68         gfxPartitions.push_back(std::make_unique<GfxPartition>(reservedCpuAddressRange));

rootEnvCount = 1 gfxPartitions.size() = 1

./shared/source/os_interface/linux/drm_memory_manager.cpp

75 void DrmMemoryManager::initialize(gemCloseWorkerMode mode) {
78     for (uint32_t rootDeviceIndex = 0; rootDeviceIndex < gfxPartitions.size(); ++rootDeviceIndex) {
79         auto gpuAddressSpace = executionEnvironment.rootDeviceEnvironments[rootDeviceIndex]->getHardwareInfo()->capabilityTable.gpuAddressSpa     ce;
80         if (!getGfxPartition(rootDeviceIndex)->init(gpuAddressSpace, getSizeToReserve(), rootDeviceIndex, gfxPartitions.size(), heapAssigner.     apiAllowExternalHeapForSshAndDsh, DrmMemoryManager::getSystemSharedMemory(rootDeviceIndex))) {
81             initialized = false;
82             return;
83         }
84         localMemAllocs.emplace_back();

localMemAllocs.size() = 1 Only localMemAllocs[0] was initialized.

shared/source/os_interface/linux/drm_memory_operations_handler_bind.cpp 160 this->evictUnusedAllocationsImpl(memoryManager->getLocalMemAllocs(this->rootDeviceIndex), waitForCompletion)}) {

this->rootDeviceIndex = 2

shared/source/os_interface/linux/drm_memory_manager.cpp

1376 std::vector<GraphicsAllocation *> &DrmMemoryManager::getLocalMemAllocs(uint32_t rootDeviceIndex) {
1377     return this->localMemAllocs[rootDeviceIndex];
1378 }

return this->localMemAllocs[2]; localMemAllocs[2] was not initialized.

shared/source/os_interface/linux/drm_memory_operations_handler_bind.cpp

170 MemoryOperationsStatus DrmMemoryOperationsHandlerBind::evictUnusedAllocationsImpl(std::vector<GraphicsAllocation *> &allocationsForEviction, bool waitForCompletion) {
171     const auto &engines = this->rootDeviceEnvironment.executionEnvironment.memoryManager->getRegisteredEngines(this->rootDeviceIndex);
172     std::vector<GraphicsAllocation *> evictCandidates;
173
174     for (auto subdeviceIndex = 0u; subdeviceIndex < GfxCoreHelper::getSubDevicesCount(rootDeviceEnvironment.getHardwareInfo()); subdeviceIndex++) {
175         for (auto &allocation : allocationsForEviction) {
176             bool evict = true;
177
178             if (allocation->getRootDeviceIndex() != this->rootDeviceIndex) {
179                 continue;
180             }

allocation->getRootDeviceIndex() will trigger a segfault.

When we expose two cards in one process, then only localMemAllocs[0] and localMemAllocs[1] are initialized.

Example: ZE_AFFINITY_MASK=1,2 ZE_ENABLE_PCI_ID_DEVICE_ORDER=1 ./cl_gemm gpu 2048 1000

Also triggers memory reading error during eviction.

I am glad to verify your fix patch and I will also find time to continue working on this issue.

intel / compute-runtime

root device index and index of localMemAllocs conflict during eviction, and that will cause segfaults #661