Closed JiahaoYao closed 2 years ago
function.__self__.strategy.global_rank = self._strategy.global_rank function.__self__.strategy.local_rank = self._strategy.local_rank
adding the rank (global and local) but get the following error
(plt) ubuntu@ip-10-0-2-100:~/ray_lightning/ray_lightning/examples$ python ray_ddp_example.py GPU available: True, used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs (RayExecutor pid=1944) /home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/utilities/warnings.py:53: LightningDeprecationWarning: pytorch_lightning.utilities.warnings.rank_zero_deprecation has been deprecated in v1.6 and will be removed in v1.8. Use the equivalent function from the pytorch_lightning.utilities.rank_zero module instead. (RayExecutor pid=1944) new_rank_zero_deprecation( (RayExecutor pid=1944) /home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/utilities/warnings.py:58: LightningDeprecationWarning: ParallelStrategy.torch_distributed_backend was deprecated in v1.6 and will be removed in v1.8. (RayExecutor pid=1944) return new_rank_zero_deprecation(*args, **kwargs) (RayExecutor pid=1944) Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 (RayExecutor pid=1944) ---------------------------------------------------------------------------------------------------- (RayExecutor pid=1944) distributed_backend=nccl (RayExecutor pid=1944) All distributed processes registered. Starting with 2 processes (RayExecutor pid=1944) ---------------------------------------------------------------------------------------------------- (RayExecutor pid=1944) (RayExecutor pid=1945) Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2 Traceback (most recent call last): File "ray_ddp_example.py", line 173, in <module> train_mnist( File "ray_ddp_example.py", line 78, in train_mnist trainer.fit(model) File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit self._call_and_handle_interrupt( File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs) File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 61, in launch ray_output = self.run_function_on_workers( File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 214, in run_function_on_workers results = process_results(self._futures, self.tune_queue) File "/home/ubuntu/ray_lightning/ray_lightning/util.py", line 62, in process_results ray.get(ready) File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/ray/_private/worker.py", line 2193, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(AttributeError): ray::RayExecutor.execute() (pid=1945, ip=10.0.2.100, repr=<ray_lightning.launchers.ray_launcher.RayExecutor object at 0x7fb2e8789340>) File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 333, in execute return fn(*args, **kwargs) File "/home/ubuntu/ray_lightning/ray_lightning/launchers/ray_launcher.py", line 239, in _wrapping_function results = function(*args, **kwargs) File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run results = self._run_stage() File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage return self._run_train() File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1345, in _run_train self._run_sanity_check() File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1406, in _run_sanity_check val_loop._reload_evaluation_dataloaders() File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 242, in _reload_evaluation_dataloaders self.trainer.reset_val_dataloader() File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1965, in reset_val_dataloader self.num_val_batches, self.val_dataloaders = self._data_connector._reset_eval_dataloader( File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 372, in _reset_eval_dataloader dataloaders = self._request_dataloader(mode, model=model) File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 459, in _request_dataloader dataloader = source.dataloader() File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 532, in dataloader return self.instance.trainer._call_lightning_module_hook(self.name, pl_module=self.instance) File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1595, in _call_lightning_module_hook output = fn(*args, **kwargs) File "ray_ddp_example.py", line 46, in val_dataloader dataset = self.dataset File "/home/ubuntu/anaconda3/envs/plt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1207, in __getattr__ raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'MNISTClassifier' object has no attribute 'dataset' (RayExecutor pid=1944) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1] (RayExecutor pid=1944) (RayExecutor pid=1944) | Name | Type | Params (RayExecutor pid=1944) -------------------------------------- (RayExecutor pid=1944) 0 | layer_1 | Linear | 25.1 K (RayExecutor pid=1944) 1 | layer_2 | Linear | 2.1 K (RayExecutor pid=1944) 2 | layer_3 | Linear | 650 (RayExecutor pid=1944) 3 | accuracy | Accuracy | 0 (RayExecutor pid=1944) -------------------------------------- (RayExecutor pid=1944) 27.9 K Trainable params (RayExecutor pid=1944) 0 Non-trainable params (RayExecutor pid=1944) 27.9 K Total params (RayExecutor pid=1944) 0.112 Total estimated model params size (MB) Sanity Checking: 0it [00:00, ?it/s] (RayExecutor pid=1945) LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
fixed
adding the rank (global and local) but get the following error