Xformers attention backwards pass not working in diffusers | RuntimeError: p.gQ_strideM() == grad_q.stride(1) INTERNAL ASSERT FAILED at "*/mem_eff_attention/attention_backward_generic.cu":180,

Describe the bug

I've been trying to use xformers for training in dreambooth, as well as training with waifu diffusion. It crashes before finishing even 1 iteration, with the message in the logs in both instances.

Xformers appears to be working perfectly fine for inference, 100%|███████████████████████████████████████████| 51/51 [00:03<00:00, 13.33it/s] pipe.enable_xformers_memory_efficient_attention() image = pipe(prompt).images[0] 100%|███████████████████████████████████████████| 51/51 [00:02<00:00, 19.49it/s]

and running xformers memory efficient benchmark works with no issue in the same environment as well. I've tried boatloads of different python/pytorch/cuda/xformers configurations, but nothing appears to make this work. I have recieved another report of the same error occuring on a rented server with completely different hardware except for the 3090, leading me to believe this might be a general issue for this specific GPU.

Sorry if this bug report is a bit of a mess, this issue has been haunting me for a few days.

Reproduction

Run dreambooth with unet.set_use_memory_efficient_attention_xformers(True) and a 3090.

Logs

Steps:   0%|                                            | 0/800 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train_dreambooth.py", line 670, in <module>
    main(args)
  File "train_dreambooth.py", line 626, in main
    accelerator.backward(loss)
  File "/home/bunny/miniconda3/envs/xformers/lib/python3.8/site-packages/accelerate/accelerator.py", line 1007, in backward
    loss.backward(**kwargs)
  File "/home/bunny/miniconda3/envs/xformers/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/bunny/miniconda3/envs/xformers/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/bunny/miniconda3/envs/xformers/lib/python3.8/site-packages/torch/autograd/function.py", line 267, in apply
    return user_fn(self, *args)
  File "/home/bunny/miniconda3/envs/xformers/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/bunny/miniconda3/envs/xformers/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/bunny/miniconda3/envs/xformers/lib/python3.8/site-packages/torch/autograd/function.py", line 267, in apply
    return user_fn(self, *args)
  File "/home/bunny/miniconda3/envs/xformers/lib/python3.8/site-packages/xformers/ops.py", line 369, in backward
    ) = torch.ops.xformers.efficient_attention_backward_cutlass(
  File "/home/bunny/miniconda3/envs/xformers/lib/python3.8/site-packages/torch/_ops.py", line 442, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: p.gQ_strideM() == grad_q.stride(1) INTERNAL ASSERT FAILED at "/home/runner/work/xfromers_builds/xfromers_builds/xformers/xformers/components/attention/csrc/cuda/mem_eff_attention/attention_backward_generic.cu":180, please report a bug to PyTorch. 
Steps:   0%|                                            | 0/800 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/home/bunny/miniconda3/envs/xformers/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/bunny/miniconda3/envs/xformers/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/bunny/miniconda3/envs/xformers/lib/python3.8/site-packages/accelerate/commands/launch.py", line 910, in launch_command
    simple_launcher(args)
  File "/home/bunny/miniconda3/envs/xformers/lib/python3.8/site-packages/accelerate/commands/launch.py", line 400, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/bunny/miniconda3/envs/xformers/bin/python', 'train_dreambooth.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--instance_data_dir=./dog', '--output_dir=./dbout', '--instance_prompt=a photo of sks dog', '--resolution=512', '--train_batch_size=1', '--gradient_accumulation_steps=2', '--gradient_checkpointing', '--use_8bit_adam', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=800']' returned non-zero exit status 1.

System Info

diffusers version: 0.7.2
Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.17
Python version: 3.8.13
PyTorch version (GPU?): 1.13.0+cu117 (True)
Huggingface_hub version: 0.10.1
Transformers version: 4.24.0
Using GPU in script?: RTX 3090
Using distributed or parallel set-up in script?: Single GPU

huggingface / diffusers