ray-project / ray_lightning

Pytorch Lightning Distributed Accelerators using Ray
Apache License 2.0
211 stars 34 forks source link

`ray_horovod` multi pid process in the `run` #182

Open JiahaoYao opened 2 years ago

JiahaoYao commented 2 years ago

suspect: this probably the optimizer issue, the optimizers like adam and others, they store the first order and second order momentum, this would be messed up the process?

Also,

if we print the message in the run function (loops/)

        print(f"run entry")
        #import traceback
        #traceback.print_stack()
        if self.skip:
            return self.on_skip()

        self.reset()

        self.on_run_start(*args, **kwargs)

        import os
        print(f'{os.getpid()}')
        count = 0
        while not self.done:
            try:
                self.on_advance_start(*args, **kwargs)
                self.advance(*args, **kwargs)
                self.on_advance_end()
                self._restarting = False
                import os
                print(f'i am in the {count} round, pid: {os.getpid()}')
                from time import sleep
                if count == 3:
                    sleep(100)
                count += 1
            except StopIteration:
                break
        self._restarting = False

        output = self.on_run_end()

we will see that there will be three concurrent threads going through this function, the outputs looks like this

    self.advance(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 266, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 211, in run
    self.advance(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 208, in advance
    batch_output = self.batch_loop.run(batch, batch_idx)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 211, in run
    self.advance(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
    outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 196, in run
    traceback.print_stack()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/workers/default_worker.py", line 238, in <module>
    ray.worker.global_worker.main_loop()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/worker.py", line 451, in main_loop
    self.core_worker.run_task_loop()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/function_manager.py", line 675, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 462, in _resume_span
    return method(self, *_args, **_kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/horovod/ray/worker.py", line 61, in execute
    return func(self.executable)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/horovod/ray/runner.py", line 622, in <lambda>
    f = lambda w: fn(*args, **kwargs)
  File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_horovod_launcher.py", line 111, in _func
    return self._wrapping_function(function, model_ref, new_args, kwargs,
  File "/home/ray/default/ray_lightning/ray_lightning/launchers/ray_horovod_launcher.py", line 174, in _wrapping_function
    results = function(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1358, in _run_train
    self.fit_loop.run()

the high level bits are

some come from self.fit_loop.run() (this is expected)

and some come from self.optimizer_loop.run(split_batch, optimizers, batch_idx) (this is not expected).

JiahaoYao commented 2 years ago

if we print out here

    def _run_train(self) -> None:
        self._pre_training_routine()

        #with isolate_rng():
        #    self._run_sanity_check()

        # enable train mode
        self.model.train()
        torch.set_grad_enabled(True)

        self.fit_loop.trainer = self
        import ray
        import os
        print('123', os.getpid(), ray.runtime_context.get_runtime_context().node_id)
        import time
        #time.sleep(1000)
        with torch.autograd.set_detect_anomaly(self._detect_anomaly):
            self.fit_loop.run()

we are sure that only one pid on each worker.

JiahaoYao commented 2 years ago

even tried the sgd, the issue is not fixed