[Bug]: Error with SD15 Attention Injection when batch size = 2; When max pixels > 2**21

What happened?

I am using SD15. When the batch size on "Empty Latent Image" is set to 2, I get a CUDA error with torch.nn.functional.scaled_dot_product_attentionfrom attention_sharing.py and attention_pytorch.

When the batch size is 1 with SD15, there is no issue.

It's fine for SDXL models--- for both "SDXL Conv Injection" as well as "SDXL Attention Injection", there is no error with larger batch sizes.

Steps to reproduce the problem

Load the workflow.json
Run the workflow with SD15, and change the batch size to 2. You should get an error. If you reduce the batch size to 1, the error should go away.

What should have happened?

SD15 with transparency should have run with batch size 2, and produced 2 transparent images.

Commit where the problem happens

ComfyUI: 7718ada4eddf101d088b69e159011e4108286b5b ComfyUI-layerdiffuse: 6e4aeb2da78ba48c519367608a61bf47ea6249b4

Sysinfo

Linux, NVIDIA L4 from google cloud console:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA L4           Off  | 00000000:00:03.0 Off |                    0 |
| N/A   70C    P0    32W /  72W |   3598MiB / 23034MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Console logs

got prompt
model_type EPS
Using xformers attention in VAE
Using xformers attention in VAE
loaded straight to GPU
Requested to load BaseModel
Loading 1 new model
Requested to load SD1ClipModel
Loading 1 new model
Requested to load BaseModel
Loading 1 new model
  0%|                                                              | 0/20 [00:00<?, ?it/s]
!!! Exception during processing!!! CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/home/ComfyUI/execution.py", line 186, in recursive_execute
    output_data, output_ui = get_output_data(obj, input_data_all)
  File "/home/ComfyUI/execution.py", line 86, in get_output_data
    return_values = map_node_over_list(
  File "/home/ComfyUI/execution.py", line 78, in map_node_over_list
    results.append(getattr(obj, func)(**slice_dict(input_data_all, i)))
  File "/home/ComfyUI/nodes.py", line 2016, in sample
    return common_ksampler(
  File "/home/ComfyUI/nodes.py", line 1868, in common_ksampler
    samples = comfy.sample.sample(
  File "/home/ComfyUI/custom_nodes/ComfyUI-Impact-Pack/modules/impact/sample_error_enhancer.py", line 22, in informative_sample
    raise e
  File "/home/ComfyUI/custom_nodes/ComfyUI-Impact-Pack/modules/impact/sample_error_enhancer.py", line 9, in informative_sample
    return original_sample(*args, **kwargs)  # This code helps interpret error messages that occur within exceptions but does not have any impact on other operations.
  File "/home/ComfyUI/comfy/sample.py", line 85, in sample
    samples = sampler.sample(
  File "/home/ComfyUI/comfy/samplers.py", line 1118, in sample
    return sample(
  File "/home/ComfyUI/comfy/samplers.py", line 972, in sample
    return cfg_guider.sample(
  File "/home/ComfyUI/comfy/samplers.py", line 934, in sample
    output = self.inner_sample(
  File "/home/ComfyUI/comfy/samplers.py", line 888, in inner_sample
    samples = sampler.sample(
  File "/home/ComfyUI/comfy/samplers.py", line 703, in sample
    samples = self.sampler_function(
  File "/home/.local/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ComfyUI/comfy/k_diffusion/sampling.py", line 175, in sample_euler
    denoised = model(x, sigma_hat * s_in, **extra_args)
  File "/home/ComfyUI/comfy/samplers.py", line 378, in __call__
    out = self.inner_model(x, sigma, model_options=model_options, seed=seed)
  File "/home/ComfyUI/comfy/samplers.py", line 845, in __call__
    return self.predict_noise(*args, **kwargs)
  File "/home/ComfyUI/comfy/samplers.py", line 848, in predict_noise
    return sampling_function(
  File "/home/ComfyUI/comfy/samplers.py", line 341, in sampling_function
    out = calc_cond_batch(model, conds, x, timestep, model_options)
  File "/home/ComfyUI/comfy/samplers.py", line 248, in calc_cond_batch
    output = model.apply_model(input_x, timestep_, **c).chunk(batch_chunks)
  File "/home/ComfyUI/comfy/model_base.py", line 120, in apply_model
    model_output = self.diffusion_model(
  File "/home/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ComfyUI/comfy/ldm/modules/diffusionmodules/openaimodel.py", line 1058, in forward
    h = forward_timestep_embed(
  File "/home/ComfyUI/comfy/ldm/modules/diffusionmodules/openaimodel.py", line 64, in forward_timestep_embed
    x = layer(x, context, transformer_options)
  File "/home/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ComfyUI/comfy/ldm/modules/attention.py", line 854, in forward
    x = block(x, context=context[i], transformer_options=transformer_options)
  File "/home/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ComfyUI/custom_nodes/ComfyUI-layerdiffuse/lib_layerdiffusion/attention_sharing.py", line 253, in forward
    return func(self, x, context, transformer_options)
  File "/home/ComfyUI/comfy/ldm/modules/attention.py", line 691, in forward
    n = self.attn1(n, context=context_attn1, value=value_attn1)
  File "/home/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ComfyUI/custom_nodes/ComfyUI-layerdiffuse/lib_layerdiffusion/attention_sharing.py", line 239, in forward
    x = optimized_attention(q, k, v, self.heads)
  File "/home/ComfyUI/comfy/ldm/modules/attention.py", line 406, in attention_xformers
    return attention_pytorch(q, k, v, heads, mask)
  File "/home/ComfyUI/comfy/ldm/modules/attention.py", line 435, in attention_pytorch
    out = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Prompt executed in 13.60 seconds

Workflow json file

workflow (2).json

Additional information

No response

huchenlei / ComfyUI-layerdiffuse