usegalaxy-au / infrastructure

Galaxy Australia's Ansible scripts
MIT License
7 stars 18 forks source link

set pqgpu3 offline #2229

Closed cat-bro closed 1 month ago

cat-bro commented 1 month ago

Every job on pulsar-qld-gpu3 has failed since about this time last month. Some jobs are failing with the error: "INTERNAL: Failed to launch CUDA kernel". This error has not been seen on any of the other pulsars while they have been in production.

cat-bro commented 1 month ago

why don't they tell us? why do the users not send tickets?

cat-bro commented 1 month ago

The last one to run there has this (many times) in stderr

 2024-10-11 12:04:12.865450: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:695] could not allocate CUDA stream for context 0x6afb460: CUDA_ERROR_ECC_UNCORRECTABLE: uncorrectable ECC error encountered                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            +
 2024-10-11 12:04:12.865503: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/stream.cc:297] failed to allocate stream during initialization                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +
 2024-10-11 12:04:12.865599: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:614] unable to add host callback: CUDA_ERROR_INVALID_HANDLE: invalid resource handle