Multiple tests failing on v0.13.1

antoche commented 1 year ago

Describe the bug

The following tests are failing on my system due to pipeline output differences:

tests/models/test_models_unet_2d_condition.py::UNet2DConditionModelTests::test_lora_xformers_on_off
tests/pipelines/stable_diffusion_2/test_stable_diffusion_depth.py::StableDiffusionDepth2ImgPipelineFastTests::test_stable_diffusion_depth2img_default_case
tests/pipelines/stable_diffusion_2/test_stable_diffusion_depth.py::StableDiffusionDepth2ImgPipelineFastTests::test_stable_diffusion_depth2img_multiple_init_images
tests/pipelines/stable_diffusion_2/test_stable_diffusion_depth.py::StableDiffusionDepth2ImgPipelineFastTests::test_stable_diffusion_depth2img_negative_prompt
tests/pipelines/stable_diffusion_2/test_stable_diffusion_depth.py::StableDiffusionDepth2ImgPipelineFastTests::test_stable_diffusion_depth2img_pil

Example output:

self = <tests.models.test_models_unet_2d_condition.UNet2DConditionModelTests testMethod=test_lora_xformers_on_off>

    @unittest.skipIf(
        torch_device != "cuda" or not is_xformers_available(),
        reason="XFormers attention is only available with CUDA and `xformers` installed",
    )
    def test_lora_xformers_on_off(self):
        # enable deterministic behavior for gradient checkpointing
        init_dict, inputs_dict = self.prepare_init_args_and_inputs_for_common()

        init_dict["attention_head_dim"] = (8, 16)

        torch.manual_seed(0)
        model = self.model_class(**init_dict)
        model.to(torch_device)
        lora_attn_procs = create_lora_layers(model)
        model.set_attn_processor(lora_attn_procs)

        # default
        with torch.no_grad():
            sample = model(**inputs_dict).sample

            model.enable_xformers_memory_efficient_attention()
            on_sample = model(**inputs_dict).sample

            model.disable_xformers_memory_efficient_attention()
            off_sample = model(**inputs_dict).sample

>       assert (sample - on_sample).abs().max() < 1e-4
E       AssertionError: assert tensor(0.0001, device='cuda:0') < 0.0001
E        +  where tensor(0.0001, device='cuda:0') = <built-in method max of Tensor object at 0x7f471d854c20>()
E        +    where <built-in method max of Tensor object at 0x7f471d854c20> = tensor([[[[6.0350e-06, 1.4305e-06, 7.6145e-06,  ..., 8.1956e-07,\n           3.8743e-07, 3.9041e-06],\n          [3.9041...         [1.9222e-06, 4.9919e-06, 2.6152e-06,  ..., 8.6427e-07,\n           1.7229e-06, 1.0431e-06]]]], device='cuda:0').max
E        +      where tensor([[[[6.0350e-06, 1.4305e-06, 7.6145e-06,  ..., 8.1956e-07,\n           3.8743e-07, 3.9041e-06],\n          [3.9041...         [1.9222e-06, 4.9919e-06, 2.6152e-06,  ..., 8.6427e-07,\n           1.7229e-06, 1.0431e-06]]]], device='cuda:0') = <built-in method abs of Tensor object at 0x7f471cd75950>()
E        +        where <built-in method abs of Tensor object at 0x7f471cd75950> = (tensor([[[[-0.0620,  0.1676,  0.1127,  ..., -0.1332, -0.2842, -0.1374],\n          [ 0.1619, -0.0101, -0.3427,  ...,  0..., -0.1122,  0.3035],\n          [-0.1103,  0.0668,  0.0039,  ...,  0.1633, -0.0184,  0.1598]]]],\n       device='cuda:0') - tensor([[[[-0.0620,  0.1676,  0.1127,  ..., -0.1332, -0.2842, -0.1374],\n          [ 0.1619, -0.0101, -0.3427,  ...,  0..., -0.1122,  0.3035],\n          [-0.1103,  0.0668,  0.0039,  ...,  0.1633, -0.0184,  0.1598]]]],\n       device='cuda:0')).abs

tests/models/test_models_unet_2d_condition.py:441: AssertionError

(here the error is right on the threshold)

Another example:

self = <tests.pipelines.stable_diffusion_2.test_stable_diffusion_depth.StableDiffusionDepth2ImgPipelineFastTests testMethod=test_stable_diffusion_depth2img_default_case>

    def test_stable_diffusion_depth2img_default_case(self):
        device = "cpu"  # ensure determinism for the device-dependent torch.Generator
        components = self.get_dummy_components()
        pipe = StableDiffusionDepth2ImgPipeline(**components)
        pipe = pipe.to(device)
        pipe.set_progress_bar_config(disable=None)

        inputs = self.get_dummy_inputs(device)
        image = pipe(**inputs).images
        image_slice = image[0, -3:, -3:, -1]

        assert image.shape == (1, 32, 32, 3)
        if torch_device == "mps":
            expected_slice = np.array([0.6071, 0.5035, 0.4378, 0.5776, 0.5753, 0.4316, 0.4513, 0.5263, 0.4546])
        else:
            expected_slice = np.array([0.6312, 0.4984, 0.4154, 0.4788, 0.5535, 0.4599, 0.4017, 0.5359, 0.4716])

>       assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3
E       AssertionError: assert 0.23728442068099975 < 0.001
E        +  where 0.23728442068099975 = <built-in method max of numpy.ndarray object at 0x7f46c743a150>()
E        +    where <built-in method max of numpy.ndarray object at 0x7f46c743a150> = array([0.05664153, 0.12669238, 0.07194659, 0.23728442, 0.18821609,\n       0.09381225, 0.08221953, 0.08237534, 0.03171295]).max
E        +      where array([0.05664153, 0.12669238, 0.07194659, 0.23728442, 0.18821609,\n       0.09381225, 0.08221953, 0.08237534, 0.03171295]) = <ufunc 'absolute'>((array([0.68784153, 0.37170762, 0.4873466 , 0.7160844 , 0.7417161 ,\n       0.55371225, 0.48391953, 0.61827534, 0.50331295], dtype=float32) - array([0.6312, 0.4984, 0.4154, 0.4788, 0.5535, 0.4599, 0.4017, 0.5359,\n       0.4716])))
E        +        where <ufunc 'absolute'> = np.abs
E        +        and   array([0.68784153, 0.37170762, 0.4873466 , 0.7160844 , 0.7417161 ,\n       0.55371225, 0.48391953, 0.61827534, 0.50331295], dtype=float32) = <built-in method flatten of numpy.ndarray object at 0x7f46c743acf0>()
E        +          where <built-in method flatten of numpy.ndarray object at 0x7f46c743acf0> = array([[0.68784153, 0.37170762, 0.4873466 ],\n       [0.7160844 , 0.7417161 , 0.55371225],\n       [0.48391953, 0.61827534, 0.50331295]], dtype=float32).flatten

tests/pipelines/stable_diffusion_2/test_stable_diffusion_depth.py:301: AssertionError

(here the error is way above the threshold)

Those failures are hard to troubleshoot because there is no clear reason as to what might be causing them. I have tried running those tests on different systems to find out whether they might be coming from hardware differences, but this is time consuming and not necessarily solving anything.

Some things I am noticing which I think could be improved:

Those tests output are hard to read. I would recommend moving to torch.testing.assert_allcloseor np.testing.assert_allclose, which are precisely designed for these types of tests.
I cannot see any CI runs on github for that branch that would tell me whether those tests are actually passing for anyone else, so I can't really say whether the issue is on my end or not. I would recommend transparent CI tests (such as the CircleCI on transformers) which would quickly tell me whether or not I even need to investigate such failures.
I can see the device used in those tests is still "cpu" even though they should be fine running on the GPU now that #1514 is closed. I would recommend using a CPU-based generator but running the pipelines on the GPU for faster testing.

Reproduction

Run tests as normal.

Logs

See above.

System Info

Tried on multiple linux machines with various Nvidia GPUs.

Running from branch v0.13.1

diffusers version: 0.13.1 Platform: Linux-4.14.240-weta-20210804-x86_64-with-glibc2.27 Python version: 3.9.10 PyTorch version (GPU?): 1.12.0a0+git664058f (True) Huggingface_hub version: 0.11.1 Transformers version: 4.26.0 Accelerate version: 0.13.1 xFormers version: 0.0.14.dev

patrickvonplaten commented 1 year ago

Hey @antoche,

Thanks for spotting this, most of those should be fixed by now. Think all of them were because of precision and OOM problems

antoche commented 1 year ago

I've just tried on the 0.14.0 tag and I am still getting the same failures. Note they're not OOM errors.

patrickvonplaten commented 1 year ago

Yes with PyTorch 2.0 being relased we got a couple new failures - I'll need to spend a day fixing all of those soon! But seems like they are all minor precision errors.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

antoche commented 1 year ago

Just keeping this alive as this issue is still relevant and the suggestions are still valid IMO. Having a view into and access to a CI system would be a great help arleady.

patrickvonplaten commented 1 year ago

Hey @antoche,

Could you open a PR with your suggested improvements?

Note: 1.) All tests run on GPU but the latents are created on CPU because it improves precision. 2.) We can only due so much regarding small precision errors due to: https://huggingface.co/docs/diffusers/main/en/using-diffusers/reproducibility#create-reproducible-pipelines 3.)

Those tests output are hard to read. I would recommend moving to torch.testing.assert_allcloseor np.testing.assert_allclose, which are precisely designed for these types of tests.

TBH I wouldn't necessarily agree here, I like it if error messages are very in-detail

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

pcuenca commented 1 year ago

We are going to try to review the tests soon, so removing stale label.

antoche commented 1 year ago

Hi, just wanted to update this ticket to mention that the test_stable_diffusion_depth tests are now passing, but test_models_unet_2d_condition.py::UNet2DConditionModelTests::test_lora_xformers_on_off is still failing, and now I'm also seeing a simliar failure with test_models_unet_3d_condition.py::UNet3DConditionModelTests::test_lora_xformers_on_off, with a very large error:

>       assert (sample - on_sample).abs().max() < 1e-4
E       AssertionError: assert tensor(0.5566, device='cuda:0') < 0.0001
E        +  where tensor(0.5566, device='cuda:0') = <built-in method max of Tensor object at 0x7f619cf12720>()
E        +    where <built-in method max of Tensor object at 0x7f619cf12720> = tensor([[[[[2.3253e-04, 1.5353e-03, 6.9697e-04,  ..., 1.9233e-04,\n            6.2592e-05, 1.4339e-04],\n           [1.0...       [1.1253e-04, 1.3334e-04, 4.8172e-04,  ..., 1.6327e-03,\n            4.1591e-04, 1.6979e-03]]]]], device='cuda:0').max
E        +      where tensor([[[[[2.3253e-04, 1.5353e-03, 6.9697e-04,  ..., 1.9233e-04,\n            6.2592e-05, 1.4339e-04],\n           [1.0...       [1.1253e-04, 1.3334e-04, 4.8172e-04,  ..., 1.6327e-03,\n            4.1591e-04, 1.6979e-03]]]]], device='cuda:0') = <built-in method abs of Tensor object at 0x7f619f7b5d60>()
E        +        where <built-in method abs of Tensor object at 0x7f619f7b5d60> = (tensor([[[[[ 1.1231e-01,  4.4056e-02, -1.6904e-02,  ...,  5.2370e-02,\n             2.0205e-02,  1.8632e-01],\n         ... [-3.4701e-01,  2.9261e-01, -5.1616e-01,  ...,  6.4207e-02,\n            -8.3352e-02, -3.3661e-01]]]]], device='cuda:0') - tensor([[[[[ 1.1208e-01,  4.5591e-02, -1.7601e-02,  ...,  5.2562e-02,\n             2.0268e-02,  1.8646e-01],\n         ... [-3.4712e-01,  2.9247e-01, -5.1568e-01,  ...,  6.2574e-02,\n            -8.3768e-02, -3.3491e-01]]]]], device='cuda:0')).abs

This was with diffusers-0.16.1 and xformers-0.0.20.

Regarding the third point, my comment is not about removing detail, but about increasing readability (and therefore maintainability and ease of contribution).

For example when replacing the assertion above with torch.testing.assert_close, here's the resulting failure message:

>       torch.testing.assert_close(sample, on_sample, rtol=0, atol=1e-4)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 58082 / 65536 (88.6%)
E       Greatest absolute difference: 0.2881329655647278 at index (1, 2, 3, 17, 28) (up to 0.0001 allowed)
E       Greatest relative difference: 789.9036144578313 at index (2, 3, 3, 12, 31) (up to 0 allowed)

It is not only much more readable, but actually gives more information than the original check.

patrickvonplaten commented 1 year ago

torch.testing.assert_close indeed looks nice!

huggingface / diffusers