When running with default settings I get a warning:
/scratch/users/sfarr/miniconda3/envs/tmd/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:448: The combination of `DataLoader(`pin_memory=True`, `persistent_workers=True`) and `Trainer(reload_dataloaders_every_n_epochs > 0)` can lead to instability due to limitations in PyTorch (https://github.com/pytorch/pytorch/issues/91252). We recommend setting `pin_memory=False` in this case.
It will run the training for a long time but eventually crash with something like (using 4 GPUs):
File "/home/steve/miniconda3/envs/tmd/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1289, in _get_data
raise RuntimeError('Pin memory thread exited unexpectedly')
RuntimeError: Pin memory thread exited unexpectedly
[rank: 1] Child process with PID 224779 terminated with code 1. Forcefully terminating all other processes to avoid zombies :zombie:
I can make it stable by setting number workers to zero and #322
But I can see that the performance is slightly lower on 4 gpus (4.8 it/s vs 5.0 it/s) than if I use num_workers=4 which eventually crashes.
These options should be able to be set in the config yaml. I do not know the performance effects of either.
When running with default settings I get a warning:
It will run the training for a long time but eventually crash with something like (using 4 GPUs):
I can make it stable by setting number workers to zero and #322 But I can see that the performance is slightly lower on 4 gpus (4.8 it/s vs 5.0 it/s) than if I use num_workers=4 which eventually crashes.
These options should be able to be set in the config yaml. I do not know the performance effects of either.