Grad strides do not match bucket view strides warning

vturrisi / solo-learn

solo-learn: a library of self-supervised methods for visual representation learning powered by Pytorch Lightning

MIT License

1.39k stars 181 forks source link

Grad strides do not match bucket view strides warning #306

Closed lavoiems closed 11 months ago

lavoiems commented 1 year ago

Describe the bug I get the following warning when using the deafult byol.yaml imagenet config.

UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [2048, 512, 1, 1], strides() = [512, 1, 1, 1]
bucket_view.sizes() = [2048, 512, 1, 1], strides() = [512, 1, 512, 512] (Triggered internally at  ../torch/csrc/distributed/c10d/reducer.cpp:326.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

To Reproduce Run: python main_pretrain.py --config-path scripts/pretrain/imagenet --config-name byol.yaml

Versions

>>> pytorch_lightning.__version__
'1.6.4'
>>> torch.__version__
'1.12.1+cu102'
>>> torchvision.__version__
'0.13.1+cu102'

I am using the main branch of solo-learn of today.

vturrisi commented 1 year ago

Hi,

Thanks for reporting it. I believe it's related to the channel last memory format and some inconsistency in our code. If you can, check if disabling it removes the warning. Regardless, it's a harmless warning (code will run just fine, but maybe a bit slower), but I'll try to check it more thoroughly in the next few days.

lavoiems commented 1 year ago

Hi Victor,

Forcing self.no_channel_last=True in base.py did not fix the issue.

Best,

vturrisi commented 1 year ago

Hey @lavoiems. I can confirm that the error comes from the channel_last memory format. From my understanding of the channels_last stuff we should both move the model and the inputs. However, if I comment out:

if not cfg.performance.disable_channel_last:
        model = model.to(memory_format=torch.channels_last)

from main_pretrain.py the error is gone and code still works (I would assume that having a mismatch between memory formats between the model and the input would throw an error. I've seen some people talking about the data not being contiguous but neither calling .contiguous() in the model nor the inputs solved the warning. Also, it seem like pytorch 1.11 doesn't have this issue (can you also try to check this just in case? I might have messed up the version when checking so this might not be true.)

I will investigate this further in the next few days, but you can disable this conversion by passing +performance.disable_channel_last=True to your python call.