ray-project / ray_lightning

Pytorch Lightning Distributed Accelerators using Ray
Apache License 2.0
211 stars 34 forks source link

`ray_ddp` gpu issue #179

Open JiahaoYao opened 2 years ago

JiahaoYao commented 2 years ago
ray::ImplicitFunc.train() (pid=27359, ip=172.31.59.24, repr=_inner_train)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/trainable.py", line 360, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 404, in step
    self._report_thread_runner_error(block=True)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 574, in _report_thread_runner_error
    raise e
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 277, in run
    self._entrypoint()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 349, in entrypoint
    return self._trainable_func(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/function_runner.py", line 645, in _trainable_func
    output = fn()
  File "test_tune.py", line 37, in _inner_train
    trainer.fit(model)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 62, in launch
    ray_output = self.run_function_on_workers(
  File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 224, in run_function_on_workers
    results = process_results(self._futures, self.tune_queue)
  File "/home/ray/default/ray_lightning/ray_lightning/util.py", line 62, in process_results
    ray.get(ready)
ray.exceptions.RayTaskError(RuntimeError): ray::RayExecutor.execute() (pid=27475, ip=172.31.59.24, repr=<ray_lightning.launchers.ray_launcher.RayExecutor object at 0x7f2c3c105610>)
  File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 356, in execute
    return fn(*args, **kwargs)
  File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 256, in _wrapping_function
    self._strategy.set_cuda_device_if_used()
  File "/home/ray/default/ray_lightning/ray_lightning/ray_ddp.py", line 233, in set_cuda_device_if_used
    torch.cuda.set_device(self.root_device)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

gives CUDA error: invalid device ordinal

JiahaoYao commented 2 years ago

def tune_test(dir, strategy):
    callbacks = [TuneReportCallback(on="validation_end")]
    analysis = tune.run(
        train_func(dir, strategy, callbacks=callbacks),
        config={"max_epochs": tune.choice([1, 2, 3])},
        resources_per_trial=get_tune_resources(
            num_workers=strategy.num_workers, use_gpu=strategy.use_gpu),
        num_samples=2)
    assert all(analysis.results_df["training_iteration"] ==
               analysis.results_df["config.max_epochs"])

def test_tune_iteration_ddp():
    """Tests if each RayStrategy runs the correct number of iterations."""
    tmpdir = './'
    strategy = RayStrategy(num_workers=2, use_gpu=True)
    tune_test(tmpdir, strategy)

this is the code to reproduce the error.

JiahaoYao commented 2 years ago

https://github.com/Lightning-AI/lightning/issues/2407

JiahaoYao commented 2 years ago

it seems like the gpu id issue? can not assign torch.cuda.set_device