iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.79k stars 604 forks source link

Always requiring HOST_VISIBLE memory causes out of memory on NVIDIA GPU #5268

Closed MaheshRavishankar closed 3 years ago

MaheshRavishankar commented 3 years ago

With #5264 Mobilebert was changed to run as well in CI (there was a missing iree.symbol.export that prevented it). With this, the CPU side runs fine, but the GPU/Vulkan-SPIRV side fails to run. With the old path I get this error

$ ../build-vulkan-tracy/iree/tools/iree-run-mlir -export-all -iree-hal-target-backends=vulkan-spirv ../iree/iree/test/e2e/models/bert_encoder_unrolled_fake_weights.mlir 
I /home/ravishankarm/Development/iree/iree/iree/tools/iree-run-mlir-main.cc:205] Compiling for target backend 'vulkan-spirv*'...
I /home/ravishankarm/Development/iree/iree/iree/tools/iree-run-mlir-main.cc:318] Evaluating all functions in module for driver 'vulkan'...
I /home/ravishankarm/Development/iree/iree/iree/tools/utils/vm_util.cc:258] Creating driver and device for 'vulkan'...
EXEC @serving_default
E /home/ravishankarm/Development/iree/iree/iree/tools/iree-run-mlir-main.cc:469] Failure for split at line #1: /home/ravishankarm/Development/iree/iree/iree/hal/vulkan/status_util.c:53: RESOURCE_EXHAUSTED; VK_ERROR_OUT_OF_DEVICE_MEMORY; vmaCreateBuffer; while invoking native function hal.allocator.allocate; while calling import; Evaluating export function 2; Evaluating functions
ERROR running file (../iree/iree/test/e2e/models/bert_encoder_unrolled_fake_weights.mlir): /home/ravishankarm/Development/iree/iree/iree/hal/vulkan/status_util.c:53: RESOURCE_EXHAUSTED; VK_ERROR_OUT_OF_DEVICE_MEMORY; vmaCreateBuffer; while invoking native function hal.allocator.allocate; while calling import; Evaluating export function 2; Evaluating functions
pure virtual method called
terminate called without an active exception
Aborted

with the new path I get this error

 ../build-vulkan-tracy/iree/tools/iree-run-mlir -export-all -iree-hal-target-backends=vulkan-spirv -iree-flow-dispatch-linalg-on-tensors -iree-codegen-spirv-experimental-linalg-on-tensors ../iree/iree/test/e2e/models/bert_encoder_unrolled_fake_weights.mlir 
I /home/ravishankarm/Development/iree/iree/iree/tools/iree-run-mlir-main.cc:205] Compiling for target backend 'vulkan-spirv*'...
I /home/ravishankarm/Development/iree/iree/iree/tools/iree-run-mlir-main.cc:318] Evaluating all functions in module for driver 'vulkan'...
I /home/ravishankarm/Development/iree/iree/iree/tools/utils/vm_util.cc:258] Creating driver and device for 'vulkan'...
EXEC @serving_default
E /home/ravishankarm/Development/iree/iree/iree/tools/iree-run-mlir-main.cc:469] Failure for split at line #1: /home/ravishankarm/Development/iree/iree/iree/hal/vulkan/status_util.c:53: RESOURCE_EXHAUSTED; VK_ERROR_OUT_OF_DEVICE_MEMORY; vmaCreateBuffer; while invoking native function hal.allocator.allocate; while calling import; Evaluating export function 2; Evaluating functions
ERROR running file (../iree/iree/test/e2e/models/bert_encoder_unrolled_fake_weights.mlir): /home/ravishankarm/Development/iree/iree/iree/hal/vulkan/status_util.c:53: RESOURCE_EXHAUSTED; VK_ERROR_OUT_OF_DEVICE_MEMORY; vmaCreateBuffer; while invoking native function hal.allocator.allocate; while calling import; Evaluating export function 2; Evaluating functions
free(): double free detected in tcache 2
pure virtual method called
terminate called without an active exception
Aborted

Seems to be different errors.

ThomasRaoux commented 3 years ago

This seems to pass on Android after Mahesh enabled it to run: https://buildkite.com/iree/iree-android-arm64-v8a/builds/3873#f314defd-3437-45cd-9c10-066fecd3ef51 So the failure is only seen on Nvida GPUs so far.

MaheshRavishankar commented 3 years ago

In any case. It is disabled for all GPUs (I am happy to turn it back on if there is a way to avoid running this just on NVIDIA hardware)

hanhanW commented 3 years ago

Probably related to #5162

ThomasRaoux commented 3 years ago

Probably related to #5162

This is related to running several tests in parallel right? This happens locally for me just running this single test.

hanhanW commented 3 years ago

oh I see. I didn't notice that it's run manually with iree-run-mlir, sorry.

antiagainst commented 3 years ago

Started to look into this.

antiagainst commented 3 years ago

Looks it's because we are always allocating memory with both DIVICE_LOCAL_BIT and HOST_VISIBLE_BIT. The HOST_VISIBLE_BIT means a special heap with limited total size and we are exceeding that. For example, on my local RTX 2070 Super, where I can repro this issue, vulkaninfo gives the following:

VkPhysicalDeviceMemoryProperties:
=================================
memoryHeaps: count = 3
        memoryHeaps[0]:
                size   = 8589934592 (0x200000000) (8.00 GiB)
                budget = 7674331136 (0x1c96d0000) (7.15 GiB)
                usage  = 0 (0x00000000) (0.00 B)
                flags: count = 1
                        MEMORY_HEAP_DEVICE_LOCAL_BIT
        memoryHeaps[1]:
                size   = 25235137536 (0x5e021a400) (23.50 GiB)
                budget = 25235137536 (0x5e021a400) (23.50 GiB)
                usage  = 0 (0x00000000) (0.00 B)
                flags: count = 0
                        None
        memoryHeaps[2]:
                size   = 257949696 (0x0f600000) (246.00 MiB)
                budget = 226754560 (0x0d840000) (216.25 MiB)
                usage  = 31195136 (0x01dc0000) (29.75 MiB)
                flags: count = 1
                        MEMORY_HEAP_DEVICE_LOCAL_BIT
memoryTypes: count = 11
        memoryTypes[0]:
                heapIndex     = 1
                propertyFlags = 0x0000: count = 0
                        None
                usable for:
                        IMAGE_TILING_OPTIMAL:
                                None
                        IMAGE_TILING_LINEAR:
                                color images
                                (non-sparse, non-transient)
        ... ...
        memoryTypes[10]:
                heapIndex     = 2
                propertyFlags = 0x0007: count = 3
                        MEMORY_PROPERTY_DEVICE_LOCAL_BIT
                        MEMORY_PROPERTY_HOST_VISIBLE_BIT
                        MEMORY_PROPERTY_HOST_COHERENT_BIT
                usable for:
                        IMAGE_TILING_OPTIMAL:
                                None
                        IMAGE_TILING_LINEAR:
                                None

A simple export VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_api_dump shows that we are always allocating memory with memoryTypeIndex = 10, which uses memoryHeaps[2]. We only have 246MB there.. I don't know the heap status for Tesla T4 used for our CI, but I think it might be something similar.

This is passing for Android because we have unified memory there. So DEVICE_LOCAL_BIT and HOST_VISIBLE_BIT is available for the whole memory, e.g., for Mali G77.

Looking into the HAL code, it seems we have a TODO here, which forces HOST_VISIBLE_BIT on:

https://github.com/google/iree/blob/1386d2cd284bf8babafdf90a0823afabd0e0c9f8/iree/hal/vulkan/vma_allocator.cc#L158-L169

@benvanik: looks like we can just delete the above?

antiagainst commented 3 years ago

Yeah, I can confirm that after deleting L166-L167 in the above, the test passes.

benvanik commented 3 years ago

awesome - good find! we may be able to get rid of the host visible/mapping bit soon (if not already) - part of the reason for it was hal.buffer.fill and such, but now those are all command buffer operations and don't require host-visible memory. lots of GPUs have limits on that memory type because they are limited to the PCI-E aperture size.

benvanik commented 3 years ago

Ah nevermind with BufferLoadOp we may still need this for now - both HAL_BufferFillOp and HAL_BufferCopyOp can be removed though. We need the analysis to know that a particular buffer is read back on the host to set the right bit (which is the same analysis I need for allocation). We could at least ensure we aren't setting it for constants and such.

benvanik commented 3 years ago

(detensoring would also help - it's used now for those silly tensor loop iterators)

GMNGeoffrey commented 3 years ago

We're having the same issue with large_cifar10_tests__applications__iree_vulkan__model__ResNet50: https://source.cloud.google.com/results/invocations/ad178b26-b4a9-4ec4-9aa8-80e41e43a8f1/targets/iree%2Fgcp_ubuntu%2Fcmake-bazel%2Flinux%2Fx86-turing%2Fmain/log

antiagainst commented 3 years ago

I've my hands full with a few other bugs to fix. For now I'll disable the ResNet50 test to avoid hiding other potential failures. I'll come back to this after addressing the other issues, if Ben didn't solve it before I can. :)

benvanik commented 3 years ago

After #5328 lands I can take a look at this. Doing the buffer tracking to know when a readback is required (and thus HOST_VISIBLE must be set) is a good next step for buffer allocation.