Closed MaheshRavishankar closed 3 years ago
This seems to pass on Android after Mahesh enabled it to run: https://buildkite.com/iree/iree-android-arm64-v8a/builds/3873#f314defd-3437-45cd-9c10-066fecd3ef51 So the failure is only seen on Nvida GPUs so far.
In any case. It is disabled for all GPUs (I am happy to turn it back on if there is a way to avoid running this just on NVIDIA hardware)
Probably related to #5162
Probably related to #5162
This is related to running several tests in parallel right? This happens locally for me just running this single test.
oh I see. I didn't notice that it's run manually with iree-run-mlir
, sorry.
Started to look into this.
Looks it's because we are always allocating memory with both DIVICE_LOCAL_BIT
and HOST_VISIBLE_BIT
. The HOST_VISIBLE_BIT
means a special heap with limited total size and we are exceeding that. For example, on my local RTX 2070 Super, where I can repro this issue, vulkaninfo
gives the following:
VkPhysicalDeviceMemoryProperties:
=================================
memoryHeaps: count = 3
memoryHeaps[0]:
size = 8589934592 (0x200000000) (8.00 GiB)
budget = 7674331136 (0x1c96d0000) (7.15 GiB)
usage = 0 (0x00000000) (0.00 B)
flags: count = 1
MEMORY_HEAP_DEVICE_LOCAL_BIT
memoryHeaps[1]:
size = 25235137536 (0x5e021a400) (23.50 GiB)
budget = 25235137536 (0x5e021a400) (23.50 GiB)
usage = 0 (0x00000000) (0.00 B)
flags: count = 0
None
memoryHeaps[2]:
size = 257949696 (0x0f600000) (246.00 MiB)
budget = 226754560 (0x0d840000) (216.25 MiB)
usage = 31195136 (0x01dc0000) (29.75 MiB)
flags: count = 1
MEMORY_HEAP_DEVICE_LOCAL_BIT
memoryTypes: count = 11
memoryTypes[0]:
heapIndex = 1
propertyFlags = 0x0000: count = 0
None
usable for:
IMAGE_TILING_OPTIMAL:
None
IMAGE_TILING_LINEAR:
color images
(non-sparse, non-transient)
... ...
memoryTypes[10]:
heapIndex = 2
propertyFlags = 0x0007: count = 3
MEMORY_PROPERTY_DEVICE_LOCAL_BIT
MEMORY_PROPERTY_HOST_VISIBLE_BIT
MEMORY_PROPERTY_HOST_COHERENT_BIT
usable for:
IMAGE_TILING_OPTIMAL:
None
IMAGE_TILING_LINEAR:
None
A simple export VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_api_dump
shows that we are always allocating memory with memoryTypeIndex = 10
, which uses memoryHeaps[2]
. We only have 246MB there.. I don't know the heap status for Tesla T4 used for our CI, but I think it might be something similar.
This is passing for Android because we have unified memory there. So DEVICE_LOCAL_BIT
and HOST_VISIBLE_BIT
is available for the whole memory, e.g., for Mali G77.
Looking into the HAL code, it seems we have a TODO here, which forces HOST_VISIBLE_BIT
on:
@benvanik: looks like we can just delete the above?
Yeah, I can confirm that after deleting L166-L167 in the above, the test passes.
awesome - good find! we may be able to get rid of the host visible/mapping bit soon (if not already) - part of the reason for it was hal.buffer.fill and such, but now those are all command buffer operations and don't require host-visible memory. lots of GPUs have limits on that memory type because they are limited to the PCI-E aperture size.
Ah nevermind with BufferLoadOp we may still need this for now - both HAL_BufferFillOp and HAL_BufferCopyOp can be removed though. We need the analysis to know that a particular buffer is read back on the host to set the right bit (which is the same analysis I need for allocation). We could at least ensure we aren't setting it for constants and such.
(detensoring would also help - it's used now for those silly tensor
We're having the same issue with large_cifar10_tests__applications__iree_vulkan__model__ResNet50
: https://source.cloud.google.com/results/invocations/ad178b26-b4a9-4ec4-9aa8-80e41e43a8f1/targets/iree%2Fgcp_ubuntu%2Fcmake-bazel%2Flinux%2Fx86-turing%2Fmain/log
I've my hands full with a few other bugs to fix. For now I'll disable the ResNet50 test to avoid hiding other potential failures. I'll come back to this after addressing the other issues, if Ben didn't solve it before I can. :)
After #5328 lands I can take a look at this. Doing the buffer tracking to know when a readback is required (and thus HOST_VISIBLE must be set) is a good next step for buffer allocation.
With #5264 Mobilebert was changed to run as well in CI (there was a missing
iree.symbol.export
that prevented it). With this, the CPU side runs fine, but the GPU/Vulkan-SPIRV side fails to run. With the old path I get this errorwith the new path I get this error
Seems to be different errors.