ControlNet pipeline failed on mac M1 with "Assertion error: torch not compiled with cuda enabled"

RELNO commented 1 year ago

Describe the bug

ControlNet pipeline failed on mac M1 with "Assertion error: torch not compiled with cuda enabled"

I've managed to follow the M1/M2 instructions to run baseline SD diffusers as described here: https://huggingface.co/docs/diffusers/optimization/mps

However, other pipelines failed with Assertion error: torch not compiled with cuda enabled. This is despite using device.to(mps)

Reproduction

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
import torch

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-depth")

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", 
    controlnet=controlnet) 

pipe = pipe.to("mps")
pipe.scheduler = UniPCMultistepScheduler.from_config(
    pipe.scheduler.config)

pipe.enable_model_cpu_offload()
controlnet = controlnet.to("mps")
generator = torch.Generator(device="cpu").manual_seed(user_seed)

# can't run on mps
# pipe.enable_xformers_memory_efficient_attention()

prompt = " Space station, pro photography, RAW photo, 8k uhd, dslr, soft lighting, high quality, film grain, Fujifilm XT3"
n_prompt = "text, watermark, blurry, not sharp, not accurate"
user_seed = -1

# open local image file and as init_image
from PIL import Image
init_img = Image.open("test.jpg")

_ = pipe(
    prompt,
    init_img,
    num_inference_steps=1
)

generator = torch.Generator(device="cpu").manual_seed(user_seed)
pipe = pipe.to("mps")
pipe.enable_attention_slicing()

image = pipe(
    prompt=prompt,
    negative_prompt=n_prompt,
    width=800,
    height=800,
    image=init_img,
    generator=generator,
    num_inference_steps=30
    ).images[0]

Logs

Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[9], line 1
----> 1 _ = pipe(
      2     prompt,
      3     init_img,
      4     num_inference_steps=1
      5 )

File ~/git/venv/mlr/lib/python3.9/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__..decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

File ~/git/venv/mlr/lib/python3.9/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_controlnet.py:697, in StableDiffusionControlNetPipeline.__call__(self, prompt, image, height, width, num_inference_steps, guidance_scale, negative_prompt, num_images_per_prompt, eta, generator, latents, prompt_embeds, negative_prompt_embeds, output_type, return_dict, callback, callback_steps, cross_attention_kwargs, controlnet_conditioning_scale)
    694 do_classifier_free_guidance = guidance_scale > 1.0
    696 # 3. Encode input prompt
--> 697 prompt_embeds = self._encode_prompt(
    698     prompt,
    699     device,
    700     num_images_per_prompt,
    701     do_classifier_free_guidance,
    702     negative_prompt,
    703     prompt_embeds=prompt_embeds,
...
    222 if _cudart is None:
    223     raise AssertionError(
    224         "libcudart functions unavailable. It looks like you have a broken build?")

AssertionError: Torch not compiled with CUDA enabled

System Info

diffusers version: 0.14.0
Platform: macOS-13.2.1-arm64-arm-64bit
Python version: 3.9.6
PyTorch version (GPU?): 1.13.1 (False)
Huggingface_hub version: 0.13.1
Transformers version: 4.26.1
Accelerate version: 0.17.0
xFormers version: not installed
Using GPU in script?: NO
Using distributed or parallel set-up in script?: NO

sayakpaul commented 1 year ago

Cc: @pcuenca

patrickvonplaten commented 1 year ago

Hey @RELNO,

Could you try replacing:

pipe.enable_model_cpu_offload()

with:

pipe.to("mps")

?

patrickvonplaten commented 1 year ago

think the model cpu offload doesn't work yet on MPS

RELNO commented 1 year ago

@patrickvonplaten thanks for looking into this. device.to('cuda') resulted with the same error (which I guess makes sense given that MBPs are not cuda machines)


AssertionError                            Traceback (most recent call last)
Cell In[3], line 1
----> 1 pipe.to("cuda")

...
    222 if _cudart is None:
    223     raise AssertionError(
    224         "libcudart functions unavailable. It looks like you have a broken build?")

AssertionError: Torch not compiled with CUDA enabled

RELNO commented 1 year ago

I made a bit of a progress with the following setup (still fails at full inference tho):

pipe = pipe.to("mps")
pipe.enable_attention_slicing()

pipe.scheduler = UniPCMultistepScheduler.from_config(
    pipe.scheduler.config)
generator = torch.Generator(device="cpu").manual_seed(-1)

# pipe.enable_model_cpu_offload() # no cpu offload to avoid above issue

With this I could run the one-step inference (precook MPS device) and also 1 inference step to create an image:

     prompt=prompt,
    negative_prompt=n_prompt,
    width=128,
    height=128,
    image=init_img,
    generator=generator,
    num_inference_steps=1
)

However, anything above 1 or 2 inference steps, gets this error:

Cell In[6], line 1
----> 1 image = pipe(
      2     prompt=prompt,
      3     negative_prompt=n_prompt,
      4     width=128,
      5     height=128,
      6     image=init_img,
      7     generator=generator,
      8     num_inference_steps=5
      9     ).images[0]

File [~/GIT/venv/mlr/lib/python3.9/site-packages/torch/autograd/grad_mode.py:27](https://file+.vscode-resource.vscode-cdn.net/Users/noyman/GIT/test_sd/~/GIT/venv/mlr/lib/python3.9/site-packages/torch/autograd/grad_mode.py:27), in _DecoratorContextManager.__call__..decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

File [~/GIT/venv/mlr/lib/python3.9/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_controlnet.py:779](https://file+.vscode-resource.vscode-cdn.net/Users/noyman/GIT/test_sd/~/GIT/venv/mlr/lib/python3.9/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_controlnet.py:779), in StableDiffusionControlNetPipeline.__call__(self, prompt, image, height, width, num_inference_steps, guidance_scale, negative_prompt, num_images_per_prompt, eta, generator, latents, prompt_embeds, negative_prompt_embeds, output_type, return_dict, callback, callback_steps, cross_attention_kwargs, controlnet_conditioning_scale)
    776     noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
    778 # compute the previous noisy sample x_t -> x_t-1
--> 779 latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs).prev_sample
    781 # call the callback, if provided
    782 if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
...
--> 465     rhos_c = torch.linalg.solve(R, b)
    467 if self.predict_x0:
    468     x_t_ = sigma_t [/](https://file+.vscode-resource.vscode-cdn.net/) sigma_s0 * x - alpha_t * h_phi_1 * m0

NotImplementedError: The operator 'aten::_linalg_solve_ex.result' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

takuma104 commented 1 year ago

I also have an M1 environment, so I tried it out (it was my first time using mps because I hadn't updated my OS), and it seems that there may be issues with using UniPCMultistepScheduler in the mps environment as well. When I used the default scheduler (i.e., didn't specify anything), it seemed to work well, with a speed of around ~~13.75s/it~~ 3s/it.

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch
import PIL.Image as Image

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-depth", 
    torch_dtype=torch.float16)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    controlnet=controlnet, 
    safety_checker=None).to("mps") 

pipe.enable_attention_slicing()

prompt = "Space station, pro photography, RAW photo, 8k uhd, dslr, soft lighting, high quality, film grain, Fujifilm XT3"

image = pipe(
    prompt,
    image=Image.open("some_depth_image.png"),
    num_inference_steps=10,
).images[0]

image.save('output.png')

RELNO commented 1 year ago

@takuma104 that's great to know, but I'm afraid it failed on my system with the following:

  image = pipe(
  File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_controlnet.py", line 749, in __call__
    down_block_res_samples, mid_block_res_sample = self.controlnet(
  File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/diffusers/models/controlnet.py", line 461, in forward
    sample, res_samples = downsample_block(
  File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/diffusers/models/unet_2d_blocks.py", line 837, in forward
    hidden_states = attn(
  File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/diffusers/models/transformer_2d.py", line 265, in forward
    hidden_states = block(
  File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/diffusers/models/attention.py", line 291, in forward
    attn_output = self.attn1(
  File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/diffusers/models/cross_attention.py", line 205, in forward
    return self.processor(
  File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/diffusers/models/cross_attention.py", line 593, in __call__
    attn_slice = attn.get_attention_scores(query_slice, key_slice, attn_mask_slice)
  File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/diffusers/models/cross_attention.py", line 234, in get_attention_scores
    baddbmm_input = torch.empty(
RuntimeError: Invalid buffer size: 18.94 GB

I saw this issue in pytorch as well [https://github.com/pytorch/pytorch/issues/78042] I wonder if that's related

Can you share your specs? System Info diffusers version: 0.14.0 Platform: macOS-13.2.1-arm64-arm-64bit Python version: 3.9.6 PyTorch version (GPU?): 1.13.1 (False) Huggingface_hub version: 0.13.1 Transformers version: 4.26.1 Accelerate version: 0.17.0 xFormers version: not installed Using GPU in script?: NO Using distributed or parallel set-up in script?: NO

takuma104 commented 1 year ago

@RELNO Hmm, really? The only version difference between your environment and mine was Python(3.10.9). My Mac is a first-generation M1 MacBook Pro with 16GB of memory and I am using a virtual environment with conda.

~~If the error is related to torch.empty(), it may be related to PR #2643. (However, since I'm not an expert in this field, this is just speculation. )~~ Looking at issue https://github.com/https://github.com/pytorch/pytorch/issues/78042, it seems that torch.empty() is not very related to the issue. The issue seems to have been closed without a clear understanding of the root cause. That's tough.

pcuenca commented 1 year ago

Thanks, @takuma104, I was going to comment the same. The other schedulers work fine, so I'd recommend the use of DPMSolverMultistepScheduler, which is about as fast as the one that doesn't work.

And thanks a lot @RELNO for raising these issues. We'll see how to improve the following:

Update the documentation to explain that pipe.enable_model_cpu_offload is only appropriate for cuda devices.
Write a disclaimer that UniPCMultistepScheduler does not yet work for mps.
Investigate the problem with that scheduler and try to fix it.

These reports help improve the quality of diffusers, we appreciate them a lot. Sorry for the trouble!

RELNO commented 1 year ago

@pcuenca thank you, happy to help. @takuma104 I've venv'ed a @3.10.9 version to test if that's the issue (+ fresh installed all other packages) but I'm afraid it's still failing with:

    746 latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents
    747 latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
--> 749 down_block_res_samples, mid_block_res_sample = self.controlnet(
    750     latent_model_input,
    751     t,
    752     encoder_hidden_states=prompt_embeds,
...
    236     )
    237     beta = 0
    238 else:

RuntimeError: Invalid buffer size: 18.94 GB

RELNO commented 1 year ago

I believe it is now solved (@pcuenca might worth adding that to the docs as well, happy to PR if it's on gh):

adding generator = torch.Generator(device="cpu").manual_seed(-1) and pipe(..., generator=generator) to @takuma104 code resolved the issue. Got it to render 1200*600 @ 3.84/it on MBP M1 16gb

Thank you all for your help!

Full code for reference:


from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch
import PIL.Image as Image

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-depth", 
    torch_dtype=torch.float16)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    controlnet=controlnet, 
    safety_checker=None).to("mps") 

pipe.enable_attention_slicing()
generator = torch.Generator(device="cpu").manual_seed(-1)
prompt = "Space station, pro photography, RAW photo, 8k uhd, dslr, soft lighting, high quality, film grain, Fujifilm XT3"

image = pipe(
    prompt,
    width=1200,
    height=600,
    generator=generator,
    image=Image.open("test.jpg"),
    num_inference_steps=30,
).images[0]

image.save('output.png')

sayakpaul commented 1 year ago

I think it makes sense to keep this opened until and unless https://github.com/huggingface/diffusers/issues/2645#issuecomment-1466818348 is addressed.

Also, thanks so much @RELNO for your investigations and @takuma104 for your help :)

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ohmygenie commented 1 year ago

Anything discussed here does not work anymore if you're trying to run SD in your customised program (not the AUTOMATIC1111 or native app versions). The blocker is this error whenever you use MPS on Macbook M1/M2 device:

NotImplementedError: The operator 'aten::index.Tensor' is not current implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

Only option is to use CPU as device (which is very slow) or build your own native SD web app like InvokeAI or AUTOMATIC1111 which is not feasible for small projects.

Please let us know if there's any other workaround? Thanks

huggingface / diffusers