huggingface / pytorch-image-models

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
https://huggingface.co/docs/timm
Apache License 2.0
31.51k stars 4.7k forks source link

[BUG] `test_optim` fail with pytorch 2.1.0 #1986

Closed GaetanLepage closed 11 months ago

GaetanLepage commented 11 months ago

Describe the bug When using the latest version of pytorch (v2.1.0), many (all ?) tests from test_optim.py fail.

To Reproduce Steps to reproduce the behavior:

  1. pytest tests/test_optim.py

Expected behavior The test pass.

Screenshots End of the logs:

        # Run both optimizations in parallel
        for _i in range(20):
            optimizer.step(fn)
            optimizer_c.step(fn_c)
            #assert torch.equal(weight, weight_c)
            #assert torch.equal(bias, bias_c)
            torch_tc.assertEqual(weight, weight_c)
            torch_tc.assertEqual(bias, bias_c)
        # Make sure state dict wasn't modified
>       torch_tc.assertEqual(state_dict, state_dict_c)
E       AssertionError: Tensor-likes are not close!
E
E       Mismatched elements: 50 / 50 (100.0%)
E       Greatest absolute difference: 69.72129821777344 at index (3, 2) (up to 1e-05 allowed)
E       Greatest relative difference: 9.549407005310059 at index (9, 4) (up to 1.3e-06 allowed)
E
E       The failure occurred for item ['state'][0]['momentum']

tests/test_optim.py:94: AssertionError
=========================== short test summary info ============================
FAILED tests/test_optim.py::test_sgd[sgd] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_adam[adamw] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_adam[adam] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_adam[nadam] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_adam[adamax] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_adabelief[adabelief] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_rectified[radam] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_rectified[radabelief] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_adaother[adadelta] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_adaother[adagrad] - AssertionError: Scalars are not close!
FAILED tests/test_optim.py::test_adafactor[adafactor] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_lamb[lamb] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_lamb[lambc] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_lars[lars] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_lars[larc] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_lars[nlars] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_lars[nlarc] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_madgrad[madgrad] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_madgrad[madgradw] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_novograd[novograd] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_rmsprop[rmsprop] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_rmsprop[rmsproptf] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_adamp[adamp] - AssertionError: Tensor-likes are not close!
FAILED tests/test_optim.py::test_sgdp[sgdp] - AssertionError: Tensor-likes are not close!
================== 24 failed, 6 passed, 5 deselected in 4.93s ==================

Desktop (please complete the following information):

rwightman commented 11 months ago

hrmm, have not run the full tests on 2.1 but have been training with it, this is a bit concerning... will look

rwightman commented 11 months ago

It passes if I comment out the CPU part of the test and leave CUDA only (running on local machine w/ GPU). This is a pickle

EDIT: disregard that, failing on a diff part of the test.

rwightman commented 11 months ago

So the torch_tc.assertEqual(state_dict, state_dict_c) is only part failing, and it seemes the behaviour of

    state_dict_c = deepcopy(optimizer.state_dict())
    optimizer_c.load_state_dict(state_dict_c)

changed such that load_state_dict possibly cloned the tensors in the past and doesn't anymore? PyTorch optim tests removed that check line....

rwightman commented 11 months ago

Sorted, the params are no longer deepcopied on the optimizer load_state_dict call so tests needed to be changed slightly

GaetanLepage commented 11 months ago

Thank you very much for this very quick fix :)