Open jdchn opened 2 years ago
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.
What happened + What you expected to happen
1
Training a PyTorch-based policy with Tune inside a container results in an error:
CUDA error: the provided PTX was compiled with an unsupported toolchain
2
The expected behavior is no error.
3
This is on a DGX A100 using NGC PyTorch as a base image. Dockerfile.txt
However...
Trainer.train()
inside the container → No errorThe error is not encountered if the base image is reverted to
21.10-py3
.Relevant configuration parameters:
Inside the container, PyTorch and PyTorch extensions are built with
TORCH_CUDA_ARCH_LIST='5.2;6.0;6.1;7.0;7.5;8.0;8.6+PTX
. This was confirmed with cuobjdump.Binaries for CUDA 11 should be minor-version compatible per the CUDA Compatibility Guide, but the PTX is not expected to be compatible.
One difference between training with Tune versus
Trainer.train()
seems to be whether the trainer is run in the driver process or a worker process. One hypothesis is that something about a Ray worker process causes the CUDA runtime to select PTX over SASS.Attempting to force SASS selection with CUDA environment variables (i.e.,
CUDA_DISABLE_PTX_JIT=1
) results in a different CUDA error:CUDA error: PTX JIT compilation was disabled
This error occurs with both Tune and
Trainer.train()
.I have not attempted to rebuild PyTorch.
This may not be a Ray issue, but the most apparent symptom is different behavior between Tune and
Trainer.train()
.Traceback:
Versions / Dependencies
Reproduction script
To start the container:
Issue Severity
Medium: It is a significant difficulty but I can work around it.