torchmd / torchmd-net

Training neural network potentials
MIT License
326 stars 73 forks source link

persistent workers=True and Pinned memory=True is unstable #324

Closed sef43 closed 4 months ago

sef43 commented 4 months ago

When running with default settings I get a warning:

/scratch/users/sfarr/miniconda3/envs/tmd/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:448: The combination of `DataLoader(`pin_memory=True`, `persistent_workers=True`) and `Trainer(reload_dataloaders_every_n_epochs > 0)` can lead to instability due to limitations in PyTorch (https://github.com/pytorch/pytorch/issues/91252). We recommend setting `pin_memory=False` in this case.

It will run the training for a long time but eventually crash with something like (using 4 GPUs):

File "/home/steve/miniconda3/envs/tmd/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1289, in _get_data
    raise RuntimeError('Pin memory thread exited unexpectedly')
RuntimeError: Pin memory thread exited unexpectedly
[rank: 1] Child process with PID 224779 terminated with code 1. Forcefully terminating all other processes to avoid zombies :zombie:

I can make it stable by setting number workers to zero and #322 But I can see that the performance is slightly lower on 4 gpus (4.8 it/s vs 5.0 it/s) than if I use num_workers=4 which eventually crashes.

These options should be able to be set in the config yaml. I do not know the performance effects of either.

sef43 commented 4 months ago

Note that I am using the ACE dataset type, might be specific to this, I am not sure

RaulPPelaez commented 4 months ago

pin memory is a little borked in lightning, it will crash if you have "enough" workers. We should probably just turn it off