Open Bensuo opened 1 year ago
We are seeing that our AMDGPU runners become unusable sometimes. When it happens I see the following:
# /opt/rocm-4.5.1/bin/rocminfo
ROCk module is loaded
hsa api call failure at: /long_pathname_so_that_rpms_can_package_the_debug_info/src/rocminfo/rocminfo.cc:1143
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn thre
ads or create internal OS-specific events.
@intel/llvm-reviewers-cuda , @npmiller , any ideas what might be the reason and if there are any preventive measures we can take (like GPU reset for Intel GPUs maybe)?
I wonder if ROCm may be upgraded to 5.x and there are two AMD GPUs for testing.
I wonder if ROCm may be upgraded to 5.x
+ @bader
there are two AMD GPUs for testing
What do you mean by that?
I wonder if ROCm may be upgraded to 5.x
- @bader
@aelovikov-intel, I think @AerialMantis or @npmiller can answer this question.
GPU reset for Intel GPUs maybe
GPU reset is not a preventive measure for issues like this. It helps to recover the GPU state after something bad happed, but it doesn't prevent GPU driver to go out of resources.
We've had the ROCk module is loaded
issue happen as well, but we haven't found a good preventative measure for it either, when that happens it usually requires either to reload the kernel module or a reboot.
And yes it's fine to bump to ROCm 5.x, but I believe this was done already, sorry about the delayed reply.
This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days.
Pre-commit workflows for PRs are failing when running the E2E test-suite for the HIP backend as it fails to detect a HIP device when starting the tests. First seen affecting #10216 but also seen in other workflow runs on other PRs.
Some links to failed runs: