Closed arunppsg closed 6 months ago
My code had the following line in the train_loop
to report metrics:
if i % 20 == 0:
ray.train.report(metrics={"loss": loss.detach().cpu().item()})
But as the docs says here, train.report()
has to be called on each worker. If I am not wrong, the condition statement prevents train.report
on each worker as it gets called only in the worker where i % 20 == 0
. Removing the condition statement fixes the issue.
What happened + What you expected to happen
Training on a single-node with a single-gpu works but when I scale the training to multi-node multi-gpu, the training hangs (probably in
loss.backward()
call).Log details which I observed:
Versions / Dependencies
Ray: 2.9.2 OS: Ubuntu 20.04 PyTorch: 2.2.0+cu121
Reproduction script
The puzzling part for me here is when I tried other similar training scripts in the same environment, they worked.
Issue Severity
High: It blocks me from completing my task.