[Core][RLlib][Tune] CUDA PTX error when training with Tune

What happened + What you expected to happen

1

Training a PyTorch-based policy with Tune inside a container results in an error:

CUDA error: the provided PTX was compiled with an unsupported toolchain

2

The expected behavior is no error.

3

This is on a DGX A100 using NGC PyTorch as a base image. Dockerfile.txt

However...

Training with Tune outside the container → No error
Training with Trainer.train() inside the container → No error
Training with Tune inside the container with the default model → No error

The error is not encountered if the base image is reverted to 21.10-py3.

Relevant configuration parameters:

config['framework'] = "torch"
config['num_workers'] = 0
config['num_gpus'] = 1

Inside the container, PyTorch and PyTorch extensions are built with TORCH_CUDA_ARCH_LIST='5.2;6.0;6.1;7.0;7.5;8.0;8.6+PTX. This was confirmed with cuobjdump.

Binaries for CUDA 11 should be minor-version compatible per the CUDA Compatibility Guide, but the PTX is not expected to be compatible.

One difference between training with Tune versus Trainer.train() seems to be whether the trainer is run in the driver process or a worker process. One hypothesis is that something about a Ray worker process causes the CUDA runtime to select PTX over SASS.

Attempting to force SASS selection with CUDA environment variables (i.e., CUDA_DISABLE_PTX_JIT=1) results in a different CUDA error:

CUDA error: PTX JIT compilation was disabled

This error occurs with both Tune and Trainer.train().

I have not attempted to rebuild PyTorch.

This may not be a Ray issue, but the most apparent symptom is different behavior between Tune and Trainer.train().

Traceback:

ray::MyPPOTrainer.__init__()
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 830, in __init__
    super().__init__(
  File "/opt/conda/lib/python3.8/site-packages/ray/tune/trainable.py", line 149, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 911, in setup
    self.workers = WorkerSet(
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 162, in __init__
    self._local_worker = self._make_worker(
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 567, in _make_worker
    worker = cls(
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 626, in __init__
    self._build_policy_map(
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1722, in _build_policy_map
    self.policy_map.create_policy(
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/policy_map.py", line 152, in create_policy
    self[policy_id] = class_(observation_space, action_space, merged_config)
  File "./bin/learnconn_demo.py", line 133, in __init__
    super().__init__(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/agents/ppo/ppo_torch_policy.py", line 40, in __init__
    TorchPolicy.__init__(
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 232, in __init__
    self.model_gpu_towers.append(model_copy.to(self.devices[i]))
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 907, in to
    return self._apply(convert)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  [Previous line repeated 4 more times]
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 601, in _apply
    param_applied = fn(param)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 905, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: the provided PTX was compiled with an unsupported toolchain.

Versions / Dependencies

Host:
- Ubuntu 20.04.3
- NVIDIA Driver 470.82.01
- CUDA 11.4
- Python 3.8.10
- PyTorch 1.11 (CUDA 11.3)
Containers:
- NGC PyTorch nvcr.io/nvidia/pytorch:22.02-py3 (CUDA 11.6, PyTorch 1.11)
- NGC PyTorch nvcr.io/nvidia/pytorch:21.10-py3 (CUDA 11.4, PyTorch 1.10)
Packages:
- requirements.txt
- PyTorch Geometric 2.0.4
- PyTorch Scatter 2.0.9
- PyTorch Cluster 1.6.0
- PyTorch Sparse 0.6.13
- PyTorch Spline-Base Conv 1.2.1

Reproduction script

To start the container:

  docker run -it --rm --name=${container_name} \
    -u $(id -u):$(id -g) -e USER=$USER -e HOME=$HOME -v $HOME:$HOME \
    -v "${container_work}":/workspace/vol \
    --workdir="/workspace/vol" \
    --gpus '"device='${gpudevice}'"' \
    --shm-size=32g \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    ${image_name}

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import ray.rllib.agents.ppo
import ray.rllib.models
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
from ray.rllib.utils.annotations import override
import ray.tune
import numpy as np
import torch
from torch_geometric.nn import MLP

class DebugModel(TorchModelV2, torch.nn.Module):
    def __init__(self, obs_space, act_space, num_outputs, config, name):
        TorchModelV2.__init__(self, obs_space, act_space, num_outputs, config, name)
        torch.nn.Module.__init__(self)

        in_ch = int(np.product(obs_space.shape))

        self._policy_model = MLP(in_channels=in_ch, hidden_channels=1,
                                 out_channels=num_outputs, num_layers=1,
                                 batch_norm=False)
        self._value_model = MLP(in_channels=in_ch, hidden_channels=1,
                                out_channels=1, num_layers=1,
                                batch_norm=False)

        self._values = None

    @override(TorchModelV2)
    def forward(self, input_dict, state,  seq_lens):
        obs = input_dict["obs_flat"].float()
        obs = obs.reshape(obs.shape[0], -1)

        logits = self._policy_model(obs)
        self._values = self._value_model(obs).squeeze(1)

        return logits, state

    @override(TorchModelV2)
    def value_function(self):
        return self._values

ray.rllib.models.ModelCatalog.register_custom_model("DebugModel", DebugModel)

config = ray.rllib.agents.ppo.DEFAULT_CONFIG.copy()
config['framework'] = "torch"
config['env'] = "CartPole-v0"
config['num_workers'] = 0
config['num_gpus'] = 1
config['model'] = ray.rllib.models.MODEL_DEFAULTS.copy()
config['model']['custom_model'] = 'DebugModel'

# → CUDA error: the provided PTX was compiled with an unsupported toolchain
ray.tune.run("PPO", config=config, stop={'training_iteration': 1})

# → No error
# trainer = ray.rllib.agents.ppo.PPOTrainer(config)
# trainer.train()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

ray-project / ray