iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.47k stars 551 forks source link

ASAN issues on hip / rocm runtime #17769

Open suryajasper opened 2 days ago

suryajasper commented 2 days ago

What happened?

I am seeing memory leaks due to the ROCm & HIP runtime backends in IREE. These issues can be reproduced simply by initializing and releasing a hip / rocm HAL device:

==1685580==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 792 byte(s) in 11 object(s) allocated from:
    #0 0x7eff0965c1e7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x7efefb74d1ab  (/opt/rocm-6.1.0/lib/libhsa-runtime64.so.1+0x611ab)

Direct leak of 72 byte(s) in 1 object(s) allocated from:
    #0 0x7eff0965c1e7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x7efefb738b5a  (/opt/rocm-6.1.0/lib/libhsa-runtime64.so.1+0x4cb5a)
    #2 0x7efefb72f3c5  (/opt/rocm-6.1.0/lib/libhsa-runtime64.so.1+0x433c5)
    #3 0x7efefb744cee  (/opt/rocm-6.1.0/lib/libhsa-runtime64.so.1+0x58cee)
    #4 0x7eff0456e555  (/opt/rocm-6.1.0/lib/libamdhip64.so+0x33c555)

Direct leak of 72 byte(s) in 1 object(s) allocated from:
    #0 0x7eff0965c1e7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x7efefb738b5a  (/opt/rocm-6.1.0/lib/libhsa-runtime64.so.1+0x4cb5a)
    #2 0x7efefb72f3c5  (/opt/rocm-6.1.0/lib/libhsa-runtime64.so.1+0x433c5)
    #3 0x7efefb72a76f  (/opt/rocm-6.1.0/lib/libhsa-runtime64.so.1+0x3e76f)
    #4 0x7efefb744cee  (/opt/rocm-6.1.0/lib/libhsa-runtime64.so.1+0x58cee)
    #5 0x7eff0456e555  (/opt/rocm-6.1.0/lib/libamdhip64.so+0x33c555)

Indirect leak of 2464 byte(s) in 7 object(s) allocated from:
    #0 0x7eff0965c1e7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x7eff042ed2db  (/opt/rocm-6.1.0/lib/libamdhip64.so+0xbb2db)

Indirect leak of 2464 byte(s) in 7 object(s) allocated from:
    #0 0x7eff0965c1e7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x7eff042ed1b1  (/opt/rocm-6.1.0/lib/libamdhip64.so+0xbb1b1)

Indirect leak of 1680 byte(s) in 7 object(s) allocated from:
    #0 0x7eff0965c1e7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x7eff042c1a06  (/opt/rocm-6.1.0/lib/libamdhip64.so+0x8fa06)

Indirect leak of 592 byte(s) in 1 object(s) allocated from:
    #0 0x7eff0965c1e7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x7eff04569be5  (/opt/rocm-6.1.0/lib/libamdhip64.so+0x337be5)

Indirect leak of 560 byte(s) in 14 object(s) allocated from:
    #0 0x7eff0965c1e7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x7eff042f0d6b  (/opt/rocm-6.1.0/lib/libamdhip64.so+0xbed6b)

Indirect leak of 480 byte(s) in 10 object(s) allocated from:
    #0 0x7eff0965aa57 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:154
    #1 0x7efefb7fee33  (/opt/rocm-6.1.0/lib/libhsa-runtime64.so.1+0x112e33)

Indirect leak of 192 byte(s) in 8 object(s) allocated from:
    #0 0x7eff0965c1e7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x7eff04558659  (/opt/rocm-6.1.0/lib/libamdhip64.so+0x326659)

Indirect leak of 176 byte(s) in 1 object(s) allocated from:
    #0 0x7eff0965a887 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
    #1 0x7eff0456aaf1  (/opt/rocm-6.1.0/lib/libamdhip64.so+0x338af1)

Indirect leak of 104 byte(s) in 1 object(s) allocated from:
    #0 0x7eff0965c1e7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x7eff0455849d  (/opt/rocm-6.1.0/lib/libamdhip64.so+0x32649d)

SUMMARY: AddressSanitizer: 9648 byte(s) leaked in 69 allocation(s).

@AWoloszyn has produced a more detailed callstack

Steps to reproduce your issue

  1. Use build of IREE with either ROCm or HIP backend support and ASAN
  2. Following MRE induces memory leaks before dispatching buffers or running any workloads. Note that replacing the device uri with local-sync, local-task, cuda, or any other device aside from rocm or hip works without memory leaks.

    int main(int argc, char** argv) {
    iree_status_t status;
    
    iree_runtime_instance_t* instance = NULL;
    iree_runtime_instance_options_t instance_options;
    iree_runtime_instance_options_initialize(&instance_options);
    iree_runtime_instance_options_use_all_available_drivers(&instance_options);
    status = iree_runtime_instance_create(&instance_options,
                                        iree_allocator_system(), &instance);
    CHECK_IREE_STATUS(status, "Failed to create runtime instance");
    
    iree_hal_device_t* device = NULL;
    status = iree_hal_create_device(
      iree_runtime_instance_driver_registry(instance),
      iree_make_cstring_view("rocm://<device_id>"),
      iree_runtime_instance_host_allocator(instance), &device);
    CHECK_IREE_STATUS(status, "Failed to create HAL device");
    
    iree_hal_device_release(device);
    iree_runtime_instance_release(instance);
    }
  3. For execution-induced memory leaks, iree-run-module with rocm or hip device also fails with ASAN
    sudo ./build/tools/iree-run-module  --module=/home/surya/gemmaiperf/benchmarking/ireekernels/kernels/vmfb/gemm_1024_1024_1024_fp32.vmfb --input=1024x1024xf32 --input=1024x1024xf32 --device=rocm://7
    
    EXEC @main_0
    result[0]: hal.buffer_view
    1024x1024xf32=[0 0 0 0 0 ... [...][...][...][...][...][...][...][...]

================================================================= ==1536727==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 3096 byte(s) in 43 object(s) allocated from:

0 0x558cc642ca4d in operator new(unsigned long) (/home/surya/iree/build/tools/iree-run-module+0x12aa4d) (BuildId: b8a67ead7be48d93a3641264913c9551ee6f65bc)

#1 0x7efc69b4d1ab  (/opt/rocm-6.1.0/lib/libhsa-runtime64.so.1+0x611ab) (BuildId: 8d959daaea86c81c390e2ab66d68dc48580adbf8)

Direct leak of 144 byte(s) in 2 object(s) allocated from:

0 0x558cc642ca4d in operator new(unsigned long) (/home/surya/iree/build/tools/iree-run-module+0x12aa4d) (BuildId: b8a67ead7be48d93a3641264913c9551ee6f65bc)

#1 0x7efc69b38b5a  (/opt/rocm-6.1.0/lib/libhsa-runtime64.so.1+0x4cb5a) (BuildId: 8d959daaea86c81c390e2ab66d68dc48580adbf8)
#2 0x7efc69b2f3c5  (/opt/rocm-6.1.0/lib/libhsa-runtime64.so.1+0x433c5) (BuildId: 8d959daaea86c81c390e2ab66d68dc48580adbf8)
#3 0x7efc69b44cee  (/opt/rocm-6.1.0/lib/libhsa-runtime64.so.1+0x58cee) (BuildId: 8d959daaea86c81c390e2ab66d68dc48580adbf8)
#4 0x7efc7296e555  (/opt/rocm-6.1.0/lib/libamdhip64.so+0x33c555) (BuildId: 88e08f2e18e348127a752bf4bb9fc922e63e32e4)

Direct leak of 72 byte(s) in 1 object(s) allocated from:

0 0x558cc642ca4d in operator new(unsigned long) (/home/surya/iree/build/tools/iree-run-module+0x12aa4d) (BuildId: b8a67ead7be48d93a3641264913c9551ee6f65bc)

#1 0x7efc69b38b5a  (/opt/rocm-6.1.0/lib/libhsa-runtime64.so.1+0x4cb5a) (BuildId: 8d959daaea86c81c390e2ab66d68dc48580adbf8)
#2 0x7efc69b2f3c5  (/opt/rocm-6.1.0/lib/libhsa-runtime64.so.1+0x433c5) (BuildId: 8d959daaea86c81c390e2ab66d68dc48580adbf8)
#3 0x7efc69b2a76f  (/opt/rocm-6.1.0/lib/libhsa-runtime64.so.1+0x3e76f) (BuildId: 8d959daaea86c81c390e2ab66d68dc48580adbf8)
#4 0x7efc69b44cee  (/opt/rocm-6.1.0/lib/libhsa-runtime64.so.1+0x58cee) (BuildId: 8d959daaea86c81c390e2ab66d68dc48580adbf8)
#5 0x7efc7296e555  (/opt/rocm-6.1.0/lib/libamdhip64.so+0x33c555) (BuildId: 88e08f2e18e348127a752bf4bb9fc922e63e32e4)

Direct leak of 56 byte(s) in 1 object(s) allocated from:

0 0x558cc642ca4d in operator new(unsigned long) (/home/surya/iree/build/tools/iree-run-module+0x12aa4d) (BuildId: b8a67ead7be48d93a3641264913c9551ee6f65bc)

#1 0x7efc69b4cfe3  (/opt/rocm-6.1.0/lib/libhsa-runtime64.so.1+0x60fe3) (BuildId: 8d959daaea86c81c390e2ab66d68dc48580adbf8)
#2 0x60400028594f  (<unknown module>)

Direct leak of 56 byte(s) in 1 object(s) allocated from:

0 0x558cc63f1e68 in __interceptor_calloc (/home/surya/iree/build/tools/iree-run-module+0xefe68) (BuildId: b8a67ead7be48d93a3641264913c9551ee6f65bc)

#1 0x558cc642fb65 in iree_allocator_system_ctl (/home/surya/iree/build/tools/iree-run-module+0x12db65) (BuildId: b8a67ead7be48d93a3641264913c9551ee6f65bc)
#2 0x558cc652c6de in iree_hal_rocm_pipeline_layout_create (/home/surya/iree/build/tools/iree-run-module+0x22a6de) (BuildId: b8a67ead7be48d93a3641264913c9551ee6f65bc)
#3 0x558cc65afd57 in iree_hal_module_pipeline_layout_create module.c
#4 0x558cc65f94c4 in iree_vm_native_module_issue_call native_module.c

Direct leak of 56 byte(s) in 1 object(s) allocated from:

0 0x558cc642ca4d in operator new(unsigned long) (/home/surya/iree/build/tools/iree-run-module+0x12aa4d) (BuildId: b8a67ead7be48d93a3641264913c9551ee6f65bc)

#1 0x7efc69b4cfe3  (/opt/rocm-6.1.0/lib/libhsa-runtime64.so.1+0x60fe3) (BuildId: 8d959daaea86c81c390e2ab66d68dc48580adbf8)
#2 0x604000285a0f  (<unknown module>)

Direct leak of 56 byte(s) in 1 object(s) allocated from:

0 0x558cc642ca4d in operator new(unsigned long) (/home/surya/iree/build/tools/iree-run-module+0x12aa4d) (BuildId: b8a67ead7be48d93a3641264913c9551ee6f65bc)

#1 0x7efc69b4cfe3  (/opt/rocm-6.1.0/lib/libhsa-runtime64.so.1+0x60fe3) (BuildId: 8d959daaea86c81c390e2ab66d68dc48580adbf8)
#2 0x60400028598f  (<unknown module>)

Direct leak of 56 byte(s) in 1 object(s) allocated from:

0 0x558cc642ca4d in operator new(unsigned long) (/home/surya/iree/build/tools/iree-run-module+0x12aa4d) (BuildId: b8a67ead7be48d93a3641264913c9551ee6f65bc)

#1 0x7efc69b4cfe3  (/opt/rocm-6.1.0/lib/libhsa-runtime64.so.1+0x60fe3) (BuildId: 8d959daaea86c81c390e2ab66d68dc48580adbf8)
#2 0x6040002859cf  (<unknown module>)

Indirect leak of 2464 byte(s) in 7 object(s) allocated from:

0 0x558cc642ca4d in operator new(unsigned long) (/home/surya/iree/build/tools/iree-run-module+0x12aa4d) (BuildId: b8a67ead7be48d93a3641264913c9551ee6f65bc)

#1 0x7efc726ed2db  (/opt/rocm-6.1.0/lib/libamdhip64.so+0xbb2db) (BuildId: 88e08f2e18e348127a752bf4bb9fc922e63e32e4)

Indirect leak of 2464 byte(s) in 7 object(s) allocated from:

0 0x558cc642ca4d in operator new(unsigned long) (/home/surya/iree/build/tools/iree-run-module+0x12aa4d) (BuildId: b8a67ead7be48d93a3641264913c9551ee6f65bc)

#1 0x7efc726ed1b1  (/opt/rocm-6.1.0/lib/libamdhip64.so+0xbb1b1) (BuildId: 88e08f2e18e348127a752bf4bb9fc922e63e32e4)

Indirect leak of 2016 byte(s) in 42 object(s) allocated from:

0 0x558cc63f1e68 in __interceptor_calloc (/home/surya/iree/build/tools/iree-run-module+0xefe68) (BuildId: b8a67ead7be48d93a3641264913c9551ee6f65bc)

#1 0x7efc69bfee33  (/opt/rocm-6.1.0/lib/libhsa-runtime64.so.1+0x112e33) (BuildId: 8d959daaea86c81c390e2ab66d68dc48580adbf8)

Indirect leak of 1680 byte(s) in 7 object(s) allocated from:

0 0x558cc642ca4d in operator new(unsigned long) (/home/surya/iree/build/tools/iree-run-module+0x12aa4d) (BuildId: b8a67ead7be48d93a3641264913c9551ee6f65bc)

#1 0x7efc726c1a06  (/opt/rocm-6.1.0/lib/libamdhip64.so+0x8fa06) (BuildId: 88e08f2e18e348127a752bf4bb9fc922e63e32e4)

Indirect leak of 592 byte(s) in 1 object(s) allocated from:

0 0x558cc642ca4d in operator new(unsigned long) (/home/surya/iree/build/tools/iree-run-module+0x12aa4d) (BuildId: b8a67ead7be48d93a3641264913c9551ee6f65bc)

#1 0x7efc72969be5  (/opt/rocm-6.1.0/lib/libamdhip64.so+0x337be5) (BuildId: 88e08f2e18e348127a752bf4bb9fc922e63e32e4)

Indirect leak of 560 byte(s) in 14 object(s) allocated from:

0 0x558cc642ca4d in operator new(unsigned long) (/home/surya/iree/build/tools/iree-run-module+0x12aa4d) (BuildId: b8a67ead7be48d93a3641264913c9551ee6f65bc)

#1 0x7efc726f0d6b  (/opt/rocm-6.1.0/lib/libamdhip64.so+0xbed6b) (BuildId: 88e08f2e18e348127a752bf4bb9fc922e63e32e4)

Indirect leak of 192 byte(s) in 8 object(s) allocated from:

0 0x558cc642ca4d in operator new(unsigned long) (/home/surya/iree/build/tools/iree-run-module+0x12aa4d) (BuildId: b8a67ead7be48d93a3641264913c9551ee6f65bc)

#1 0x7efc72958659  (/opt/rocm-6.1.0/lib/libamdhip64.so+0x326659) (BuildId: 88e08f2e18e348127a752bf4bb9fc922e63e32e4)

Indirect leak of 176 byte(s) in 1 object(s) allocated from:

0 0x558cc63f1c7e in malloc (/home/surya/iree/build/tools/iree-run-module+0xefc7e) (BuildId: b8a67ead7be48d93a3641264913c9551ee6f65bc)

#1 0x7efc7296aaf1  (/opt/rocm-6.1.0/lib/libamdhip64.so+0x338af1) (BuildId: 88e08f2e18e348127a752bf4bb9fc922e63e32e4)

Indirect leak of 104 byte(s) in 1 object(s) allocated from:

0 0x558cc642ca4d in operator new(unsigned long) (/home/surya/iree/build/tools/iree-run-module+0x12aa4d) (BuildId: b8a67ead7be48d93a3641264913c9551ee6f65bc)

#1 0x7efc7295849d  (/opt/rocm-6.1.0/lib/libamdhip64.so+0x32649d) (BuildId: 88e08f2e18e348127a752bf4bb9fc922e63e32e4)

Indirect leak of 32 byte(s) in 1 object(s) allocated from:

0 0x558cc63f1e68 in __interceptor_calloc (/home/surya/iree/build/tools/iree-run-module+0xefe68) (BuildId: b8a67ead7be48d93a3641264913c9551ee6f65bc)

#1 0x558cc642fb65 in iree_allocator_system_ctl (/home/surya/iree/build/tools/iree-run-module+0x12db65) (BuildId: b8a67ead7be48d93a3641264913c9551ee6f65bc)
#2 0x558cc652c3b8 in iree_hal_rocm_descriptor_set_layout_create (/home/surya/iree/build/tools/iree-run-module+0x22a3b8) (BuildId: b8a67ead7be48d93a3641264913c9551ee6f65bc)
#3 0x558cc65a7e10 in iree_hal_module_descriptor_set_layout_create module.c
#4 0x558cc65f94c4 in iree_vm_native_module_issue_call native_module.c

SUMMARY: AddressSanitizer: 13872 byte(s) leaked in 140 allocation(s).



### What component(s) does this issue relate to?

Runtime

### Version information

_No response_

### Additional context

_No response_
suryajasper commented 2 days ago

mentioning @AWoloszyn - don't have permissions to assign

ScottTodd commented 2 days ago

Could suppress leaks from the driver:

We used to have a sanitizer_suppressions.txt in the repo containing:

leak:libGLX_nvidia.so

(when we ran GPU tests with ASan on CI, now we just run CPU tests with ASan)

benvanik commented 2 days ago

some of these may be legit - before suppressing do some due diligence (the iree_hal_rocm_descriptor_set_layout_create for example is definitely a leak on our side, and may indicate a place we are hanging on to resources we shouldn't be)