Closed cat-bro closed 1 month ago
why don't they tell us? why do the users not send tickets?
The last one to run there has this (many times) in stderr
2024-10-11 12:04:12.865450: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:695] could not allocate CUDA stream for context 0x6afb460: CUDA_ERROR_ECC_UNCORRECTABLE: uncorrectable ECC error encountered +
2024-10-11 12:04:12.865503: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/stream.cc:297] failed to allocate stream during initialization +
2024-10-11 12:04:12.865599: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:614] unable to add host callback: CUDA_ERROR_INVALID_HANDLE: invalid resource handle
Every job on pulsar-qld-gpu3 has failed since about this time last month. Some jobs are failing with the error: "INTERNAL: Failed to launch CUDA kernel". This error has not been seen on any of the other pulsars while they have been in production.