ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34k stars 5.78k forks source link

[Core][RLlib][Tune] CUDA PTX error when training with Tune #25001

Open jdchn opened 2 years ago

jdchn commented 2 years ago

What happened + What you expected to happen

1

Training a PyTorch-based policy with Tune inside a container results in an error:

CUDA error: the provided PTX was compiled with an unsupported toolchain

2

The expected behavior is no error.

3

This is on a DGX A100 using NGC PyTorch as a base image. Dockerfile.txt

However...

The error is not encountered if the base image is reverted to 21.10-py3.

Relevant configuration parameters:

config['framework'] = "torch"
config['num_workers'] = 0
config['num_gpus'] = 1

Inside the container, PyTorch and PyTorch extensions are built with TORCH_CUDA_ARCH_LIST='5.2;6.0;6.1;7.0;7.5;8.0;8.6+PTX. This was confirmed with cuobjdump.

Binaries for CUDA 11 should be minor-version compatible per the CUDA Compatibility Guide, but the PTX is not expected to be compatible.

One difference between training with Tune versus Trainer.train() seems to be whether the trainer is run in the driver process or a worker process. One hypothesis is that something about a Ray worker process causes the CUDA runtime to select PTX over SASS.

Attempting to force SASS selection with CUDA environment variables (i.e., CUDA_DISABLE_PTX_JIT=1) results in a different CUDA error:

CUDA error: PTX JIT compilation was disabled

This error occurs with both Tune and Trainer.train().

I have not attempted to rebuild PyTorch.

This may not be a Ray issue, but the most apparent symptom is different behavior between Tune and Trainer.train().

Traceback:

ray::MyPPOTrainer.__init__()
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 830, in __init__
    super().__init__(
  File "/opt/conda/lib/python3.8/site-packages/ray/tune/trainable.py", line 149, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 911, in setup
    self.workers = WorkerSet(
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 162, in __init__
    self._local_worker = self._make_worker(
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 567, in _make_worker
    worker = cls(
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 626, in __init__
    self._build_policy_map(
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1722, in _build_policy_map
    self.policy_map.create_policy(
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/policy_map.py", line 152, in create_policy
    self[policy_id] = class_(observation_space, action_space, merged_config)
  File "./bin/learnconn_demo.py", line 133, in __init__
    super().__init__(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/agents/ppo/ppo_torch_policy.py", line 40, in __init__
    TorchPolicy.__init__(
  File "/opt/conda/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 232, in __init__
    self.model_gpu_towers.append(model_copy.to(self.devices[i]))
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 907, in to
    return self._apply(convert)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  [Previous line repeated 4 more times]
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 601, in _apply
    param_applied = fn(param)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 905, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: the provided PTX was compiled with an unsupported toolchain.

Versions / Dependencies

Reproduction script

To start the container:

  docker run -it --rm --name=${container_name} \
    -u $(id -u):$(id -g) -e USER=$USER -e HOME=$HOME -v $HOME:$HOME \
    -v "${container_work}":/workspace/vol \
    --workdir="/workspace/vol" \
    --gpus '"device='${gpudevice}'"' \
    --shm-size=32g \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    ${image_name}
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import ray.rllib.agents.ppo
import ray.rllib.models
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
from ray.rllib.utils.annotations import override
import ray.tune
import numpy as np
import torch
from torch_geometric.nn import MLP

class DebugModel(TorchModelV2, torch.nn.Module):
    def __init__(self, obs_space, act_space, num_outputs, config, name):
        TorchModelV2.__init__(self, obs_space, act_space, num_outputs, config, name)
        torch.nn.Module.__init__(self)

        in_ch = int(np.product(obs_space.shape))

        self._policy_model = MLP(in_channels=in_ch, hidden_channels=1,
                                 out_channels=num_outputs, num_layers=1,
                                 batch_norm=False)
        self._value_model = MLP(in_channels=in_ch, hidden_channels=1,
                                out_channels=1, num_layers=1,
                                batch_norm=False)

        self._values = None

    @override(TorchModelV2)
    def forward(self, input_dict, state,  seq_lens):
        obs = input_dict["obs_flat"].float()
        obs = obs.reshape(obs.shape[0], -1)

        logits = self._policy_model(obs)
        self._values = self._value_model(obs).squeeze(1)

        return logits, state

    @override(TorchModelV2)
    def value_function(self):
        return self._values

ray.rllib.models.ModelCatalog.register_custom_model("DebugModel", DebugModel)

config = ray.rllib.agents.ppo.DEFAULT_CONFIG.copy()
config['framework'] = "torch"
config['env'] = "CartPole-v0"
config['num_workers'] = 0
config['num_gpus'] = 1
config['model'] = ray.rllib.models.MODEL_DEFAULTS.copy()
config['model']['custom_model'] = 'DebugModel'

# → CUDA error: the provided PTX was compiled with an unsupported toolchain
ray.tune.run("PPO", config=config, stop={'training_iteration': 1})

# → No error
# trainer = ray.rllib.agents.ppo.PPOTrainer(config)
# trainer.train()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

stale[bot] commented 2 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.