Closed MarkusSpanring closed 1 year ago
After some digging, I found that moving results to cpu when on rank zero leads to this bottleneck.
Is there actually anything in _RayOuput
here that still lives on gpu
such that it needs to be moved to cpu
?
FYI: When I replace
if trainer.strategy.local_rank == 0:
return move_data_to_device(results, "cpu")
with
if trainer.strategy.local_rank == 0:
return results
it seems to work fine. Is there a case when the first case is needed?
I noticed that teardown of the ray workers takes exceptionally long when I use RayStrategy to train slightly larger models such as
torchvision.models.resnet18
,torch.nn.LSTM
ortorch.nn.transformer
.I have used the example from ray_ddp_example.py and replaced the model with a
ResNet
and the Data withCIFAR10
to reproduce the issue. When I run vanilla PTL (setrun="ptl"
) the model finishes as expected. However, withrun="tune"
orrun="ptl_ray"
the teardown takes over a minute. I also noticed that memory usage is increasing during teardown when using RayStrategyIs there something wrong in my setup or is this the expected behavior? If you need any additional information please let me know.
Thanks in advance!
Below the conda environment that I use