ray-project / ray_lightning

Pytorch Lightning Distributed Accelerators using Ray
Apache License 2.0
211 stars 34 forks source link

`ray_ddp` global and local rank #175

Closed JiahaoYao closed 2 years ago

JiahaoYao commented 2 years ago
        function.__self__.strategy.global_rank = self._strategy.global_rank
        function.__self__.strategy.local_rank = self._strategy.local_rank

adding the rank (global and local) but get the following error

(plt) ubuntu@ip-10-0-2-100:~/ray_lightning/ray_lightning/examples$ python ray_ddp_example.py 
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
(RayExecutor pid=1944) /home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/utilities/warnings.py:53: LightningDeprecationWarning: pytorch_lightning.utilities.warnings.rank_zero_deprecation has been deprecated in v1.6 and will be removed in v1.8. Use the equivalent function from the pytorch_lightning.utilities.rank_zero module instead.
(RayExecutor pid=1944)   new_rank_zero_deprecation(
(RayExecutor pid=1944) /home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/utilities/warnings.py:58: LightningDeprecationWarning: ParallelStrategy.torch_distributed_backend was deprecated in v1.6 and will be removed in v1.8.
(RayExecutor pid=1944)   return new_rank_zero_deprecation(*args, **kwargs)
(RayExecutor pid=1944) Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
(RayExecutor pid=1944) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=1944) distributed_backend=nccl
(RayExecutor pid=1944) All distributed processes registered. Starting with 2 processes
(RayExecutor pid=1944) ----------------------------------------------------------------------------------------------------
(RayExecutor pid=1944) 
(RayExecutor pid=1945) Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
Traceback (most recent call last):
  File "ray_ddp_example.py", line 173, in <module>
    train_mnist(
  File "ray_ddp_example.py", line 78, in train_mnist
    trainer.fit(model)
  File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 61, in launch
    ray_output = self.run_function_on_workers(
  File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 214, in run_function_on_workers
    results = process_results(self._futures, self.tune_queue)
  File "/home/ubuntu/ray_lightning/ray_lightning/util.py", line 62, in process_results
    ray.get(ready)
  File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/_private/worker.py", line 2193, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::RayExecutor.execute() (pid=1945, ip=10.0.2.100, repr=<ray_lightning.launchers.ray_launcher.RayExecutor object at 0x7fb2e8789340>)
  File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 333, in execute
    return fn(*args, **kwargs)
  File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 239, in _wrapping_function
    results = function(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1345, in _run_train
    self._run_sanity_check()
  File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1406, in _run_sanity_check
    val_loop._reload_evaluation_dataloaders()
  File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 242, in _reload_evaluation_dataloaders
    self.trainer.reset_val_dataloader()
  File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1965, in reset_val_dataloader
    self.num_val_batches, self.val_dataloaders = self._data_connector._reset_eval_dataloader(
  File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 372, in _reset_eval_dataloader
    dataloaders = self._request_dataloader(mode, model=model)
  File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 459, in _request_dataloader
    dataloader = source.dataloader()
  File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 532, in dataloader
    return self.instance.trainer._call_lightning_module_hook(self.name, pl_module=self.instance)
  File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1595, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "ray_ddp_example.py", line 46, in val_dataloader
    dataset = self.dataset
  File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1207, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'MNISTClassifier' object has no attribute 'dataset'
(RayExecutor pid=1944) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
(RayExecutor pid=1944) 
(RayExecutor pid=1944)   | Name     | Type     | Params
(RayExecutor pid=1944) --------------------------------------
(RayExecutor pid=1944) 0 | layer_1  | Linear   | 25.1 K
(RayExecutor pid=1944) 1 | layer_2  | Linear   | 2.1 K 
(RayExecutor pid=1944) 2 | layer_3  | Linear   | 650   
(RayExecutor pid=1944) 3 | accuracy | Accuracy | 0     
(RayExecutor pid=1944) --------------------------------------
(RayExecutor pid=1944) 27.9 K    Trainable params
(RayExecutor pid=1944) 0         Non-trainable params
(RayExecutor pid=1944) 27.9 K    Total params
(RayExecutor pid=1944) 0.112     Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]
(RayExecutor pid=1945) LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
JiahaoYao commented 2 years ago

fixed