Closed RELNO closed 1 year ago
Cc: @pcuenca
Hey @RELNO,
Could you try replacing:
pipe.enable_model_cpu_offload()
with:
pipe.to("mps")
?
think the model cpu offload doesn't work yet on MPS
@patrickvonplaten thanks for looking into this. device.to('cuda')
resulted with the same error (which I guess makes sense given that MBPs are not cuda machines)
AssertionError Traceback (most recent call last)
Cell In[3], line 1
----> 1 pipe.to("cuda")
...
222 if _cudart is None:
223 raise AssertionError(
224 "libcudart functions unavailable. It looks like you have a broken build?")
AssertionError: Torch not compiled with CUDA enabled
I made a bit of a progress with the following setup (still fails at full inference tho):
pipe = pipe.to("mps")
pipe.enable_attention_slicing()
pipe.scheduler = UniPCMultistepScheduler.from_config(
pipe.scheduler.config)
generator = torch.Generator(device="cpu").manual_seed(-1)
# pipe.enable_model_cpu_offload() # no cpu offload to avoid above issue
With this I could run the one-step inference (precook MPS device) and also 1 inference step to create an image:
prompt=prompt,
negative_prompt=n_prompt,
width=128,
height=128,
image=init_img,
generator=generator,
num_inference_steps=1
)
However, anything above 1 or 2 inference steps, gets this error:
Cell In[6], line 1
----> 1 image = pipe(
2 prompt=prompt,
3 negative_prompt=n_prompt,
4 width=128,
5 height=128,
6 image=init_img,
7 generator=generator,
8 num_inference_steps=5
9 ).images[0]
File [~/GIT/venv/mlr/lib/python3.9/site-packages/torch/autograd/grad_mode.py:27](https://file+.vscode-resource.vscode-cdn.net/Users/noyman/GIT/test_sd/~/GIT/venv/mlr/lib/python3.9/site-packages/torch/autograd/grad_mode.py:27), in _DecoratorContextManager.__call__..decorate_context(*args, **kwargs)
24 @functools.wraps(func)
25 def decorate_context(*args, **kwargs):
26 with self.clone():
---> 27 return func(*args, **kwargs)
File [~/GIT/venv/mlr/lib/python3.9/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_controlnet.py:779](https://file+.vscode-resource.vscode-cdn.net/Users/noyman/GIT/test_sd/~/GIT/venv/mlr/lib/python3.9/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_controlnet.py:779), in StableDiffusionControlNetPipeline.__call__(self, prompt, image, height, width, num_inference_steps, guidance_scale, negative_prompt, num_images_per_prompt, eta, generator, latents, prompt_embeds, negative_prompt_embeds, output_type, return_dict, callback, callback_steps, cross_attention_kwargs, controlnet_conditioning_scale)
776 noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
778 # compute the previous noisy sample x_t -> x_t-1
--> 779 latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs).prev_sample
781 # call the callback, if provided
782 if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
...
--> 465 rhos_c = torch.linalg.solve(R, b)
467 if self.predict_x0:
468 x_t_ = sigma_t [/](https://file+.vscode-resource.vscode-cdn.net/) sigma_s0 * x - alpha_t * h_phi_1 * m0
NotImplementedError: The operator 'aten::_linalg_solve_ex.result' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
I also have an M1 environment, so I tried it out (it was my first time using mps because I hadn't updated my OS), and it seems that there may be issues with using UniPCMultistepScheduler
in the mps environment as well. When I used the default scheduler (i.e., didn't specify anything), it seemed to work well, with a speed of around 13.75s/it 3s/it.
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch
import PIL.Image as Image
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/sd-controlnet-depth",
torch_dtype=torch.float16)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
controlnet=controlnet,
safety_checker=None).to("mps")
pipe.enable_attention_slicing()
prompt = "Space station, pro photography, RAW photo, 8k uhd, dslr, soft lighting, high quality, film grain, Fujifilm XT3"
image = pipe(
prompt,
image=Image.open("some_depth_image.png"),
num_inference_steps=10,
).images[0]
image.save('output.png')
@takuma104 that's great to know, but I'm afraid it failed on my system with the following:
image = pipe(
File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_controlnet.py", line 749, in __call__
down_block_res_samples, mid_block_res_sample = self.controlnet(
File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/diffusers/models/controlnet.py", line 461, in forward
sample, res_samples = downsample_block(
File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/diffusers/models/unet_2d_blocks.py", line 837, in forward
hidden_states = attn(
File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/diffusers/models/transformer_2d.py", line 265, in forward
hidden_states = block(
File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/diffusers/models/attention.py", line 291, in forward
attn_output = self.attn1(
File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/diffusers/models/cross_attention.py", line 205, in forward
return self.processor(
File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/diffusers/models/cross_attention.py", line 593, in __call__
attn_slice = attn.get_attention_scores(query_slice, key_slice, attn_mask_slice)
File "/Users/noyman/GIT/venv/mlr/lib/python3.9/site-packages/diffusers/models/cross_attention.py", line 234, in get_attention_scores
baddbmm_input = torch.empty(
RuntimeError: Invalid buffer size: 18.94 GB
I saw this issue in pytorch as well [https://github.com/pytorch/pytorch/issues/78042] I wonder if that's related
Can you share your specs? System Info diffusers version: 0.14.0 Platform: macOS-13.2.1-arm64-arm-64bit Python version: 3.9.6 PyTorch version (GPU?): 1.13.1 (False) Huggingface_hub version: 0.13.1 Transformers version: 4.26.1 Accelerate version: 0.17.0 xFormers version: not installed Using GPU in script?: NO Using distributed or parallel set-up in script?: NO
@RELNO Hmm, really? The only version difference between your environment and mine was Python(3.10.9). My Mac is a first-generation M1 MacBook Pro with 16GB of memory and I am using a virtual environment with conda.
If the error is related to
Looking at issue https://github.com/https://github.com/pytorch/pytorch/issues/78042, it seems that torch.empty()
, it may be related to PR #2643. (However, since I'm not an expert in this field, this is just speculation. )torch.empty()
is not very related to the issue. The issue seems to have been closed without a clear understanding of the root cause. That's tough.
Thanks, @takuma104, I was going to comment the same. The other schedulers work fine, so I'd recommend the use of DPMSolverMultistepScheduler
, which is about as fast as the one that doesn't work.
And thanks a lot @RELNO for raising these issues. We'll see how to improve the following:
pipe.enable_model_cpu_offload
is only appropriate for cuda
devices.mps
.These reports help improve the quality of diffusers
, we appreciate them a lot. Sorry for the trouble!
@pcuenca thank you, happy to help. @takuma104 I've venv'ed a @3.10.9 version to test if that's the issue (+ fresh installed all other packages) but I'm afraid it's still failing with:
746 latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents
747 latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
--> 749 down_block_res_samples, mid_block_res_sample = self.controlnet(
750 latent_model_input,
751 t,
752 encoder_hidden_states=prompt_embeds,
...
236 )
237 beta = 0
238 else:
RuntimeError: Invalid buffer size: 18.94 GB
I believe it is now solved (@pcuenca might worth adding that to the docs as well, happy to PR if it's on gh):
adding generator = torch.Generator(device="cpu").manual_seed(-1)
and pipe(..., generator=generator)
to @takuma104 code resolved the issue. Got it to render 1200*600 @ 3.84/it on MBP M1 16gb
Thank you all for your help!
Full code for reference:
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch
import PIL.Image as Image
controlnet = ControlNetModel.from_pretrained(
"lllyasviel/sd-controlnet-depth",
torch_dtype=torch.float16)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
controlnet=controlnet,
safety_checker=None).to("mps")
pipe.enable_attention_slicing()
generator = torch.Generator(device="cpu").manual_seed(-1)
prompt = "Space station, pro photography, RAW photo, 8k uhd, dslr, soft lighting, high quality, film grain, Fujifilm XT3"
image = pipe(
prompt,
width=1200,
height=600,
generator=generator,
image=Image.open("test.jpg"),
num_inference_steps=30,
).images[0]
image.save('output.png')
I think it makes sense to keep this opened until and unless https://github.com/huggingface/diffusers/issues/2645#issuecomment-1466818348 is addressed.
Also, thanks so much @RELNO for your investigations and @takuma104 for your help :)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Anything discussed here does not work anymore if you're trying to run SD in your customised program (not the AUTOMATIC1111 or native app versions). The blocker is this error whenever you use MPS on Macbook M1/M2 device:
NotImplementedError: The operator 'aten::index.Tensor' is not current implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
Only option is to use CPU as device (which is very slow) or build your own native SD web app like InvokeAI or AUTOMATIC1111 which is not feasible for small projects.
Please let us know if there's any other workaround? Thanks
Describe the bug
ControlNet pipeline failed on mac M1 with "Assertion error: torch not compiled with cuda enabled"
I've managed to follow the M1/M2 instructions to run baseline SD diffusers as described here: https://huggingface.co/docs/diffusers/optimization/mps
However, other pipelines failed with
Assertion error: torch not compiled with cuda enabled
. This is despite usingdevice.to(mps)
Reproduction
Logs
System Info
diffusers
version: 0.14.0