huggingface / pytorch-image-models

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
https://huggingface.co/docs/timm
Apache License 2.0
32.28k stars 4.76k forks source link

[BUG] `test_optim` fail with pytorch 2.1.0 #1986

Closed GaetanLepage closed 1 year ago

GaetanLepage commented 1 year ago

Describe the bug When using the latest version of pytorch (v2.1.0), many (all ?) tests from test_optim.py fail.

To Reproduce Steps to reproduce the behavior:

  1. pytest tests/test_optim.py

Expected behavior The test pass.

Screenshots End of the logs:

        # Run both optimizations in parallel
        for _i in range(20):
            optimizer.step(fn)
            optimizer_c.step(fn_c)
            #assert torch.equal(weight, weight_c)
            #assert torch.equal(bias, bias_c)
            torch_tc.assertEqual(weight, weight_c)
            torch_tc.assertEqual(bias, bias_c)
        # Make sure state dict wasn't modified
>       torch_tc.assertEqual(state_dict, state_dict_c)
E       AssertionError: Tensor-likes are not close!
E
E       Mismatched elements: 50 / 50 (100.0%)
E       Greatest absolute difference: 69.72129821777344 at index (3, 2) (up to 1e-05 allowed)
E       Greatest relative difference: 9.549407005310059 at index (9, 4) (up to 1.3e-06 allowed)
E
E       The failure occurred for item ['state'][0]['momentum']

tests/test_optim.py:94: AssertionError
=========================== short test summary info ============================
FAILED tests/test_optim.py::test_sgd[sgd] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_adam[adamw] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_adam[adam] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_adam[nadam] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_adam[adamax] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_adabelief[adabelief] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_rectified[radam] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_rectified[radabelief] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_adaother[adadelta] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_adaother[adagrad] - AssertionError: Scalars are not close!
FAILED tests/test_optim.py::test_adafactor[adafactor] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_lamb[lamb] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_lamb[lambc] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_lars[lars] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_lars[larc] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_lars[nlars] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_lars[nlarc] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_madgrad[madgrad] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_madgrad[madgradw] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_novograd[novograd] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_rmsprop[rmsprop] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_rmsprop[rmsproptf] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_adamp[adamp] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_sgdp[sgdp] - AssertionError: Tensor-likes are not close!
================== 24 failed, 6 passed, 5 deselected in 4.93s ==================

Desktop (please complete the following information):

rwightman commented 1 year ago

hrmm, have not run the full tests on 2.1 but have been training with it, this is a bit concerning... will look

rwightman commented 1 year ago

It passes if I comment out the CPU part of the test and leave CUDA only (running on local machine w/ GPU). This is a pickle

EDIT: disregard that, failing on a diff part of the test.

rwightman commented 1 year ago

So the torch_tc.assertEqual(state_dict, state_dict_c) is only part failing, and it seemes the behaviour of

    state_dict_c = deepcopy(optimizer.state_dict())
    optimizer_c.load_state_dict(state_dict_c)

changed such that load_state_dict possibly cloned the tensors in the past and doesn't anymore? PyTorch optim tests removed that check line....

rwightman commented 1 year ago

Sorted, the params are no longer deepcopied on the optimizer load_state_dict call so tests needed to be changed slightly

GaetanLepage commented 1 year ago

Thank you very much for this very quick fix :)