invoke-ai / InvokeAI

InvokeAI is a leading creative engine for Stable Diffusion models, empowering professionals, artists, and enthusiasts to generate and create visual media using the latest AI-driven technologies. The solution offers an industry leading WebUI, supports terminal use through a CLI, and serves as the foundation for multiple commercial products.
https://invoke-ai.github.io/InvokeAI/
Apache License 2.0
22.86k stars 2.37k forks source link

[bug]: Crash doing `.swap` when near VRAM limit #1362

Closed JPPhoto closed 1 year ago

JPPhoto commented 1 year ago

Is there an existing issue for this?

OS

Linux

GPU

cuda

VRAM

12GB

What happened?

Using the command line, I get an unexpected error parsing a prompt when using the new .swap operator.

Screenshots

invoke> "headshot portrait of a baker wearing (a shirt).swap(an apron), insane quality, intricate, detailed, micro details, three-point warm volumetric lighting, hyperrealism photograph, vibrant color [border, frame, watermark, signature, text, border, framed, drawing, painting, sketch, rendering, bad teeth, fake eye, mutated, deformed, abnormal, asymmetrical, Pixar, collage]" -s 75 -S 2895816 -C 9.0 -I ../images_out/facex14.png -A ddim -f 0.8 -n 10
>> Parsed prompt to FlattenedPrompt:[Fragment:'headshot portrait of a baker wearing'@1.0, CrossAttentionControlSubstitute:([Fragment:'a shirt'@1.0]->[Fragment:'an apron'@1.0] ({'s_start': 0.0, 's_end': 0.2062994740159002, 't_start': 0.0, 't_end': 1.0}), Fragment:', insane quality, intricate, detailed, micro details, three-point warm volumetric lighting, hyperrealism photograph, vibrant color'@1.0]
>> loaded input image of size 512x704 from ../images_out/facex14.png
Generating:   0%|                                                                                                                                                                                                                                                                                 | 0/10 [00:00<?, ?it/s]>> Running DDIMSampler sampling starting at step 15 of 75 (60 new sampling steps)
Decoding image:   0%|                                                                                                                                                                                                                                                                             | 0/60 [00:00<?, ?it/s]
Generating:   0%|                                                                                                                                                                                                                                                                                 | 0/10 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/jovyan/work/InvokeAI/ldm/generate.py", line 459, in prompt2image
    results = generator.generate(
  File "/home/jovyan/work/InvokeAI/ldm/invoke/generator/base.py", line 90, in generate
    image = make_image(x_T)
  File "/home/jovyan/work/InvokeAI/ldm/invoke/generator/img2img.py", line 52, in make_image
    samples = sampler.decode(
  File "/home/jovyan/.conda/envs/invokeai/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/jovyan/work/InvokeAI/ldm/models/diffusion/sampler.py", line 365, in decode
    outs = self.p_sample(
  File "/home/jovyan/.conda/envs/invokeai/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/jovyan/work/InvokeAI/ldm/models/diffusion/ddim.py", line 58, in p_sample
    e_t = self.invokeai_diffuser.do_diffusion_step(
  File "/home/jovyan/work/InvokeAI/ldm/models/diffusion/shared_invokeai_diffusion.py", line 86, in do_diffusion_step
    unconditioned_next_x, conditioned_next_x = self.apply_cross_attention_controlled_conditioning(x, sigma, unconditioning, conditioning, cross_attention_control_types_to_do)
  File "/home/jovyan/work/InvokeAI/ldm/models/diffusion/shared_invokeai_diffusion.py", line 151, in apply_cross_attention_controlled_conditioning
    conditioned_next_x = self.model_forward_callback(x, sigma, edited_conditioning)
  File "/home/jovyan/work/InvokeAI/ldm/models/diffusion/ddim.py", line 13, in <lambda>
    model_forward_callback = lambda x, sigma, cond: self.model.apply_model(x, sigma, cond))
  File "/home/jovyan/work/InvokeAI/ldm/models/diffusion/ddpm.py", line 1441, in apply_model
    x_recon = self.model(x_noisy, t, **cond)
  File "/home/jovyan/.conda/envs/invokeai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jovyan/work/InvokeAI/ldm/models/diffusion/ddpm.py", line 2167, in forward
    out = self.diffusion_model(x, t, context=cc)
  File "/home/jovyan/.conda/envs/invokeai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jovyan/work/InvokeAI/ldm/modules/diffusionmodules/openaimodel.py", line 806, in forward
    h = module(h, emb, context)
  File "/home/jovyan/.conda/envs/invokeai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jovyan/work/InvokeAI/ldm/modules/diffusionmodules/openaimodel.py", line 88, in forward
    x = layer(x, context)
  File "/home/jovyan/.conda/envs/invokeai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jovyan/work/InvokeAI/ldm/modules/attention.py", line 347, in forward
    x = block(x, context=context)
  File "/home/jovyan/.conda/envs/invokeai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jovyan/work/InvokeAI/ldm/modules/attention.py", line 297, in forward
    return checkpoint(self._forward, (x, context), self.parameters(), self.checkpoint)
  File "/home/jovyan/work/InvokeAI/ldm/modules/diffusionmodules/util.py", line 159, in checkpoint
    return func(*inputs)
  File "/home/jovyan/work/InvokeAI/ldm/modules/attention.py", line 301, in _forward
    x += self.attn1(self.norm1(x.clone()))
  File "/home/jovyan/.conda/envs/invokeai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jovyan/work/InvokeAI/ldm/modules/attention.py", line 276, in forward
    r = self.get_attention_mem_efficient(q, k, v)
  File "/home/jovyan/work/InvokeAI/ldm/modules/attention.py", line 254, in get_attention_mem_efficient
    return self.einsum_op_cuda(q, k, v)
  File "/home/jovyan/work/InvokeAI/ldm/modules/attention.py", line 250, in einsum_op_cuda
    return self.einsum_op_tensor_mem(q, k, v, mem_free_total / 3.3 / (1 << 20))
  File "/home/jovyan/work/InvokeAI/ldm/modules/attention.py", line 239, in einsum_op_tensor_mem
    return self.einsum_op_slice_dim0(q, k, v, q.shape[0] // div)
  File "/home/jovyan/work/InvokeAI/ldm/modules/attention.py", line 210, in einsum_op_slice_dim0
    r[i:end] = self.einsum_lowest_level(q[i:end], k[i:end], v[i:end], dim=0, offset=i, slice_size=slice_size)
  File "/home/jovyan/work/InvokeAI/ldm/modules/attention.py", line 204, in einsum_lowest_level
    return einsum('b i j, b j d -> b i d', attention_slice, v)
  File "/home/jovyan/.conda/envs/invokeai/lib/python3.9/site-packages/torch/functional.py", line 378, in einsum
    return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
RuntimeError: einsum(): the number of subscripts in the equation (3) does not match the number of dimensions (2) for operand 0 and no ellipsis was given

>> Could not generate image.
>> Usage stats:
>>   0 image(s) generated in 1.27s
>>   Max VRAM used for this generation: 11.33G. Current VRAM utilization: 5.26G
>>   Max VRAM used since script start:  11.33G
Outputs:

Additional context

I'm running the latest (11/3/2022) InvokeAI under WSL2 using a 3060 w/12GB RAM and the complete v1.5 checkpoint.

Contact Details

No response

damian0815 commented 1 year ago

thanks @JPPhoto for helping debug this. uncommenting line 144 in cross_attention_control.py gives the following output:

in wrangler with suggested_attention_slice shape torch.Size([8, 5632, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 5632, 5632]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 5632, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 1408, 1408]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 1408, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 1408, 1408]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 1408, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 352, 352]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 352, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 352, 352]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 352, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 88, 88]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 88, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 352, 352]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 352, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 352, 352]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 352, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 352, 352]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 352, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 1408, 1408]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 1408, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 1408, 1408]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 1408, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 1408, 1408]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 1408, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 5632, 5632]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 5632, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 5632, 5632]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 5632, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 5632, 5632]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 5632, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 5632, 5632]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 5632, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 5632, 5632]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 5632, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 1408, 1408]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 1408, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 1408, 1408]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 1408, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 352, 352]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 352, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 352, 352]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 352, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 88, 88]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 88, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 352, 352]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 352, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 352, 352]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 352, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 352, 352]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 352, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 1408, 1408]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 1408, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 1408, 1408]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 1408, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 1408, 1408]) dim None
in wrangler with suggested_attention_slice shape torch.Size([8, 1408, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([4, 5632, 5632]) dim 0
in wrangler with suggested_attention_slice shape torch.Size([4, 5632, 5632]) dim 0
in wrangler with suggested_attention_slice shape torch.Size([8, 5632, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([4, 5632, 5632]) dim 0
in wrangler with suggested_attention_slice shape torch.Size([4, 5632, 5632]) dim 0
in wrangler with suggested_attention_slice shape torch.Size([8, 5632, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([2, 5632, 5632]) dim 0
in wrangler with suggested_attention_slice shape torch.Size([2, 5632, 5632]) dim 0
in wrangler with suggested_attention_slice shape torch.Size([2, 5632, 5632]) dim 0
in wrangler with suggested_attention_slice shape torch.Size([2, 5632, 5632]) dim 0
in wrangler with suggested_attention_slice shape torch.Size([8, 5632, 77]) dim None
in wrangler with suggested_attention_slice shape torch.Size([2, 5632, 5632]) dim 0

before crashing. Note None vs dim 0 - the bug is happening because the cross attention control code is assuming a fixed slicing strategy (None will stay None, dim 0 will stay dim 0), but the slicing strategy is changing dynamically between calls to attention_slice_wrangler meaning that the original attention could have been saved without slicing (None), but then when it comes to apply the edited attention the slicing strategy has become dim 0, leading to mismatches between stored attention slices and expected return value.