torchmd / torchmd-net

Training neural network potentials
MIT License
335 stars 75 forks source link

[WIP] fix stride warning #234

Closed AntonioMirarchi closed 1 year ago

AntonioMirarchi commented 1 year ago

When you train on a single GPU the following warning raise up:

UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according 
to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an 
error, but may impair performance.
grad.sizes() = [1, 64], strides() = [1, 1]
bucket_view.sizes() = [1, 64], strides() = [64, 1]

In order to remove this warning I'm proposing in this PR to change the strategy used in the pl.Trainer() to "auto". Another way could be to set the strategy based on the GPUs involved in the training.

RaulPPelaez commented 1 year ago

If you see no performance regression when training with several GPUs then its great!

RaulPPelaez commented 1 year ago

@AntonioMirarchi is this ready to merge?

AntonioMirarchi commented 1 year ago

This is solving the warning only with single GPU training, while for multiple-GPU training is still there

RaulPPelaez commented 1 year ago

I merged this one, thanks Antonio!