ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.91k stars 5.76k forks source link

[Train] `test_lightning_deepspeed` is failing #37374

Closed bveeramani closed 1 year ago

bveeramani commented 1 year ago

python/ray/train/tests/test_lightning_deepspeed.py::test_deepspeed_stages[True-3] has been consistently failing on the release branch (releases/2.6.0) since https://github.com/ray-project/ray/pull/37132 was cherry picked.

https://buildkite.com/ray-project/oss-ci-build-branch/builds/4892#0189410c-4cd8-499f-80f8-01050df7e1bf

ray.exceptions.RayTaskError(RuntimeError): ray::_RayTrainWorker__execute.get_next() (pid=11725, ip=172.16.16.3, actor_id=b5fcf169524939be7345a18101000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f7dfa009c10>)
--
  | File "/ray/python/ray/train/_internal/worker_group.py", line 32, in __execute
  | raise skipped from exception_cause(skipped)
  | File "/ray/python/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
  | train_func(*args, **kwargs)
  | File "/ray/python/ray/train/lightning/lightning_trainer.py", line 620, in _lightning_train_loop_per_worker
  | trainer.fit(lightning_module, **trainer_fit_params)
  | File "/opt/miniconda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 529, in fit
  | call._call_and_handle_interrupt(
  | File "/opt/miniconda/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 41, in _call_and_handle_interrupt
  | return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  | File "/opt/miniconda/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 91, in launch
  | return function(*args, **kwargs)
  | File "/opt/miniconda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 568, in _fit_impl
  | self._run(model, ckpt_path=ckpt_path)
  | File "/opt/miniconda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 931, in _run
  | self.strategy.setup_environment()
  | File "/opt/miniconda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 143, in setup_environment
  | self.setup_distributed()
  | File "/opt/miniconda/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 329, in setup_distributed
  | _validate_device_index_selection(self.parallel_devices)
  | File "/opt/miniconda/lib/python3.8/site-packages/lightning_fabric/strategies/deepspeed.py", line 813, in _validate_device_index_selection
  | raise RuntimeError(
  | RuntimeError: The selected device indices [1] don't match the local rank values of processes. If you need to select GPUs at a specific index, set the `CUDA_VISIBLE_DEVICES` environment variable instead. For example: `CUDA_VISIBLE_DEVICES=1`.
  | 2023-07-10 19:39:08,358 ERROR tune.py:1144 -- Trials did not complete: [LightningTrainer_49976_00000]
  | 2023-07-10 19:39:08,359 WARNING experiment_analysis.py:916 -- Failed to read the results for 1 trials:
  | - /tmp/pytest-of-root/pytest-2/test_deepspeed_stages_True_3_0/test_deepspeed_stage_3/LightningTrainer_49976_00000_0_2023-07-10_19-38-04
rickyyx commented 1 year ago

It's failing in master as well - likely a dependency issue? image

zhe-thoughts commented 1 year ago

@bveeramani @rickyyx can we try the same test now without https://github.com/ray-project/ray/pull/37132, to verify if it is indeed a dependency issue?

rickyyx commented 1 year ago

Yes - I had a PR that reverts the #37132 on the release branch, but the test is still failing: https://buildkite.com/ray-project/oss-ci-build-pr/builds/28487#01894c4a-3099-4138-86fa-055eac5a3a3f

@woshiyyya Let me know if you need me for anything.

rickyyx commented 1 year ago

So failed job:

Previos passed job:

This is probably the one ^

zhe-thoughts commented 1 year ago

Good find @rickyyx

@matthewdeng @woshiyyya do you think we should pin pytorch-lightning?

woshiyyya commented 1 year ago

Thank you @rickyyx and @zhe-thoughts . I've checked the change log of pytorch-lightning 2.0.5, they indeed added some checks over deepspeed configuration, which caused this issue. https://lightning.ai/docs/pytorch/stable/generated/CHANGELOG.html

The quick fix would be pinning the library version. But I am still think of a long term solution, so that we can fix the root cause.

woshiyyya commented 1 year ago

I drafted a fix PR. Waiting for the CI to pass. https://github.com/ray-project/ray/pull/37387

bveeramani commented 1 year ago

CI is passing on the PR. Shall we merge?