[SYCL][HIP][E2E] Pre-commit workflow test-suite can't find HIP device

intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.

Other

1.25k stars 738 forks source link

[SYCL][HIP][E2E] Pre-commit workflow test-suite can't find HIP device #10615

Open Bensuo opened 1 year ago

Bensuo commented 1 year ago

Pre-commit workflows for PRs are failing when running the E2E test-suite for the HIP backend as it fails to detect a HIP device when starting the tests. First seen affecting #10216 but also seen in other workflow runs on other PRs.

Some links to failed runs:

aelovikov-intel commented 1 year ago

We are seeing that our AMDGPU runners become unusable sometimes. When it happens I see the following:

# /opt/rocm-4.5.1/bin/rocminfo
ROCk module is loaded
hsa api call failure at: /long_pathname_so_that_rpms_can_package_the_debug_info/src/rocminfo/rocminfo.cc:1143
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn thre
ads or create internal OS-specific events.

@intel/llvm-reviewers-cuda , @npmiller , any ideas what might be the reason and if there are any preventive measures we can take (like GPU reset for Intel GPUs maybe)?

jinz2014 commented 1 year ago

I wonder if ROCm may be upgraded to 5.x and there are two AMD GPUs for testing.

aelovikov-intel commented 1 year ago

I wonder if ROCm may be upgraded to 5.x

+ @bader

there are two AMD GPUs for testing

What do you mean by that?

bader commented 1 year ago

I wonder if ROCm may be upgraded to 5.x

@bader

@aelovikov-intel, I think @AerialMantis or @npmiller can answer this question.

GPU reset for Intel GPUs maybe

GPU reset is not a preventive measure for issues like this. It helps to recover the GPU state after something bad happed, but it doesn't prevent GPU driver to go out of resources.

npmiller commented 1 year ago

We've had the ROCk module is loaded issue happen as well, but we haven't found a good preventative measure for it either, when that happens it usually requires either to reload the kernel module or a reboot.

And yes it's fine to bump to ROCm 5.x, but I believe this was done already, sorry about the delayed reply.

github-actions[bot] commented 3 days ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days.