Open JiahaoYao opened 2 years ago
if we print out here
def _run_train(self) -> None:
self._pre_training_routine()
#with isolate_rng():
# self._run_sanity_check()
# enable train mode
self.model.train()
torch.set_grad_enabled(True)
self.fit_loop.trainer = self
import ray
import os
print('123', os.getpid(), ray.runtime_context.get_runtime_context().node_id)
import time
#time.sleep(1000)
with torch.autograd.set_detect_anomaly(self._detect_anomaly):
self.fit_loop.run()
we are sure that only one pid on each worker.
even tried the sgd, the issue is not fixed
suspect: this probably the optimizer issue, the optimizers like adam and others, they store the first order and second order momentum, this would be messed up the process?
Also,
if we print the message in the run function (
loops/
)we will see that there will be three concurrent threads going through this function, the outputs looks like this
the high level bits are
some come from
self.fit_loop.run()
(this is expected)and some come from
self.optimizer_loop.run(split_batch, optimizers, batch_idx)
(this is not expected).