Open djkakadu opened 3 days ago
That doesn't look fun.
The fatality is:
CUDA out of memory. Tried to allocate 7.75 GiB. GPU
Are you using an older or workstation GPU?
I use 2060 Super 8GB
hmm. did you disable SYSMEM fallback at some point? otherwise make sure your nvidia driver is up to date. It should happily fall into shared memory and just slow down if you exceed 8GB VRAM
oh actually, i think bfloat16 was only supported from Ampere Nvidia onwards. That's worth checking into.
It was changed from float16 to bfloat16 for (phat) Mac to work better, but not helpful it it breaks it for 1000 and 2000 grn Nvidia!
it was changed in two places in this commit: https://github.com/pinokiofactory/clarity-refiners-ui/commit/80a3efdcdc2dd87ddd635518df5b69fad23e1920
change: "bfloat16"
to "float16"
Hello every image i try, i get these error: `❌ Error during processing: File "C:\pinokio\api\clarity-refiners-ui.git\app\env\lib\site-packages\refiners\fluxion\layers\chain.py", line 858, in forward super().forward(inputs) File "C:\pinokio\api\clarity-refiners-ui.git\app\env\lib\site-packages\refiners\fluxion\layers\chain.py", line 249, in forward result = self._call_layer(layer, name, intermediate_args)
File "C:\pinokio\api\clarity-refiners-ui.git\app\env\lib\site-packages\refiners\fluxion\layers\chain.py", line 249, in forward result = self._call_layer(layer, name, intermediate_args) File "C:\pinokio\api\clarity-refiners-ui.git\app\env\lib\site-packages\refiners\fluxion\layers\chain.py", line 249, in forward result = self._call_layer(layer, name, intermediate_args) File "C:\pinokio\api\clarity-refiners-ui.git\app\env\lib\site-packages\refiners\fluxion\layers\chain.py", line 922, in forward return super().forward(inputs) + inputs[0] File "C:\pinokio\api\clarity-refiners-ui.git\app\env\lib\site-packages\refiners\fluxion\layers\chain.py", line 249, in forward result = self._call_layer(layer, name, intermediate_args) File "C:\pinokio\api\clarity-refiners-ui.git\app\env\lib\site-packages\refiners\fluxion\layers\chain.py", line 249, in forward result = self._call_layer(layer, name, intermediate_args) File "C:\pinokio\api\clarity-refiners-ui.git\app\env\lib\site-packages\refiners\fluxion\layers\chain.py", line 249, in forward result = self._call_layer(layer, name, intermediate_args) File "C:\pinokio\api\clarity-refiners-ui.git\app\env\lib\site-packages\refiners\fluxion\layers\chain.py", line 922, in forward return super().forward(inputs) + inputs[0] File "C:\pinokio\api\clarity-refiners-ui.git\app\env\lib\site-packages\refiners\fluxion\layers\chain.py", line 249, in forward result = self._call_layer(layer, name, intermediate_args) File "C:\pinokio\api\clarity-refiners-ui.git\app\env\lib\site-packages\refiners\fluxion\layers\chain.py", line 249, in forward result = self._call_layer(layer, name, *intermediate_args) OutOfMemoryError: File "C:\pinokio\api\clarity-refiners-ui.git\app\env\lib\site-packages\refiners\fluxion\layers\attentions.py", line 129, in forward return self._process_attention( File "C:\pinokio\api\clarity-refiners-ui.git\app\env\lib\site-packages\refiners\fluxion\layers\attentions.py", line 29, in scaled_dot_product_attention return _scaled_dot_product_attention( CUDA out of memory. Tried to allocate 7.75 GiB. GPU
(CHAIN) SelfAttention(embedding_dim=320, num_heads=8, inner_dim=320, use_bias=False) ├── (PAR) │ └── Identity() (x3) ├── (DISTR) │ └── Linear(in_features=320, out_features=320, device=cuda:0, dtype=bfloat16) (x3) ├── >>> ScaledDotProductAttention(num_heads=8) | SD1UNet.Controlnet.DownBlocks.Chain_2.CLIPLCrossAttention.Chain_2.CrossAttentionBlock.Residual_1.SelfAttention.ScaledDotProductAttention └── Linear(in_features=320, out_features=320, device=cuda:0, dtype=bfloat16) 0: Tensor(shape=(2, 16128, 320), dtype=bfloat16, device=cuda:0, min=-6.97, max=7.16, mean=-0.05, std=1.51, norm=4841.88, grad=False) 1: Tensor(shape=(2, 16128, 320), dtype=bfloat16, device=cuda:0, min=-6.69, max=7.00, mean=-0.01, std=1.58, norm=5069.41, grad=False) 2: Tensor(shape=(2, 16128, 320), dtype=bfloat16, device=cuda:0, min=-2.86, max=2.66, mean=0.01, std=0.46, norm=1474.05, grad=False)
(RES) Residual() ├── LayerNorm(normalized_shape=(320,), device=cuda:0, dtype=bfloat16) └── >>> (CHAIN) SelfAttention(embedding_dim=320, num_heads=8, inner_dim=320, use_bias=False) | SD1UNet.Controlnet.DownBlocks.Chain_2.CLIPLCrossAttention.Chain_2.CrossAttentionBlock.Residual_1.SelfAttention ├── (PAR) │ └── Identity() (x3) ├── (DISTR) │ └── Linear(in_features=320, out_features=320, device=cuda:0, dtype=bfloat16) (x3) ├── ScaledDotProductAttention(num_heads=8) └── Linear(in_features=320, out_features=320, device=cuda:0, dtype=bfloat16) 0: Tensor(shape=(2, 16128, 320), dtype=bfloat16, device=cuda:0, min=-2.91, max=2.64, mean=0.01, std=0.54, norm=1737.50, grad=False)
(CHAIN) CrossAttentionBlock(embedding_dim=320, context_embedding_dim=768, context_key=clip_text_embedding, num_heads=8, use_bias=False) ├── >>> (RES) Residual() | SD1UNet.Controlnet.DownBlocks.Chain_2.CLIPLCrossAttention.Chain_2.CrossAttentionBlock.Residual_1 #1 │ ├── LayerNorm(normalized_shape=(320,), device=cuda:0, dtype=bfloat16) │ └── (CHAIN) SelfAttention(embedding_dim=320, num_heads=8, inner_dim=320, use_bias=False) │ ├── (PAR) ... │ ├── (DISTR) ... │ ├── ScaledDotProductAttention(num_heads=8) │ └── Linear(in_features=320, out_features=320, device=cuda:0, dtype=bfloat16) ├── (RES) Residual() #2 │ ├── LayerNorm(normalized_shape=(320,), device=cuda:0, dtype=bfloat16) │ ├── (PAR) 0: Tensor(shape=(2, 16128, 320), dtype=bfloat16, device=cuda:0, min=-1.66, max=1.97, mean=0.01, std=0.24, norm=780.96, grad=False)
(CHAIN) └── >>> (CHAIN) CrossAttentionBlock(embedding_dim=320, context_embedding_dim=768, context_key=clip_text_embedding, num_heads=8, use_bias=False) | SD1UNet.Controlnet.DownBlocks.Chain_2.CLIPLCrossAttention.Chain_2.CrossAttentionBlock ├── (RES) Residual() #1 │ ├── LayerNorm(normalized_shape=(320,), device=cuda:0, dtype=bfloat16) │ └── (CHAIN) SelfAttention(embedding_dim=320, num_heads=8, inner_dim=320, use_bias=False) ... ├── (RES) Residual() #2 │ ├── LayerNorm(normalized_shape=(320,), device=cuda:0, dtype=bfloat16) │ ├── (PAR) ... │ └── (CHAIN) Attention(embedding_dim=320, num_heads=8, key_embedding_dim=768, value_embedding_dim=768, inner_dim=320, use_bias=False) ... └── (RES) Residual() #3 ├── LayerNorm(normalized_shape=(320,), device=cuda:0, dtype=bfloat16) 0: Tensor(shape=(2, 16128, 320), dtype=bfloat16, device=cuda:0, min=-1.66, max=1.97, mean=0.01, std=0.24, norm=780.96, grad=False)
(RES) CLIPLCrossAttention(channels=320) ├── (CHAIN) #1 │ ├── GroupNorm(num_groups=32, eps=1e-06, channels=320, device=cuda:0, dtype=bfloat16) │ ├── Conv2d(in_channels=320, out_channels=320, kernel_size=(1, 1), device=cuda:0, dtype=bfloat16) │ ├── (CHAIN) StatefulFlatten(start_dim=2) │ │ ├── SetContext(context=flatten, key=sizes) │ │ └── Flatten(start_dim=2) │ └── Transpose(dim0=1, dim1=2) ├── >>> (CHAIN) | SD1UNet.Controlnet.DownBlocks.Chain_2.CLIPLCrossAttention.Chain_2 #2 │ └── (CHAIN) CrossAttentionBlock(embedding_dim=320, context_embedding_dim=768, context_key=clip_text_embedding, num_heads=8, use_bias=False) │ ├── (RES) Residual() #1 ... 0: Tensor(shape=(2, 16128, 320), dtype=bfloat16, device=cuda:0, min=-1.66, max=1.97, mean=0.01, std=0.24, norm=780.96, grad=False)
(CHAIN) ├── (SUM) ResidualBlock(in_channels=320, out_channels=320) │ ├── (CHAIN) │ │ ├── GroupNorm(num_groups=32, channels=320, device=cuda:0, dtype=bfloat16) #1 │ │ ├── SiLU() #1 │ │ ├── (SUM) RangeAdapter2d(channels=320, embedding_dim=1280) ... │ │ ├── GroupNorm(num_groups=32, channels=320, device=cuda:0, dtype=bfloat16) #2 │ │ ├── SiLU() #2 │ │ └── Conv2d(in_channels=320, out_channels=320, kernel_size=(3, 3), padding=(1, 1), device=cuda:0, dtype=bfloat16) │ └── Identity() ├── >>> (RES) CLIPLCrossAttention(channels=320) | SD1UNet.Controlnet.DownBlocks.Chain_2.CLIPLCrossAttention 0: Tensor(shape=(2, 320, 144, 112), dtype=bfloat16, device=cuda:0, min=-8.19, max=6.47, mean=-0.09, std=0.69, norm=2236.34, grad=False)
(CHAIN) DownBlocks(in_channels=4) ├── (CHAIN) #1 │ ├── Conv2d(in_channels=4, out_channels=320, kernel_size=(3, 3), padding=(1, 1), device=cuda:0, dtype=bfloat16) │ ├── (RES) Residual() │ │ ├── UseContext(context=controlnet, key=condition_tile) │ │ └── (CHAIN) ConditionEncoder() ... │ └── (PASS) │ ├── Conv2d(in_channels=320, out_channels=320, kernel_size=(1, 1), device=cuda:0, dtype=bfloat16) │ └── Lambda(_store_residual(x: torch.Tensor)) ├── >>> (CHAIN) (x2) | SD1UNet.Controlnet.DownBlocks.Chain_2 #2 │ ├── (SUM) ResidualBlock(in_channels=320, out_channels=320) 0: Tensor(shape=(2, 320, 144, 112), dtype=bfloat16, device=cuda:0, min=-6.19, max=6.31, mean=-0.04, std=0.67, norm=2141.49, grad=False)
(PASS) Controlnet(name=tile, scale=0.6) ├── (PASS) TimestepEncoder() │ ├── UseContext(context=diffusion, key=timestep) │ ├── (CHAIN) RangeEncoder(sinusoidal_embedding_dim=320, embedding_dim=1280) │ │ ├── Lambda(compute_sinusoidal_embedding(x: jaxtyping.Int[Tensor, 'batch 1']) -> jaxtyping.Float[Tensor, 'batch 1 embedding_dim']) │ │ ├── Converter(set_device=False) │ │ ├── Linear(in_features=320, out_features=1280, device=cuda:0, dtype=bfloat16) #1 │ │ ├── SiLU() │ │ └── Linear(in_features=1280, out_features=1280, device=cuda:0, dtype=bfloat16) #2 │ └── SetContext(context=range_adapter, key=timestep_embedding_tile) ├── Slicing(dim=1, end=4) 0: Tensor(shape=(2, 4, 144, 112), dtype=bfloat16, device=cuda:0, min=-3.31, max=4.03, mean=0.48, std=1.11, norm=433.98, grad=False)
(CHAIN) SD1UNet(in_channels=4) ├── >>> (PASS) Controlnet(name=tile, scale=0.6) | SD1UNet.Controlnet │ ├── (PASS) TimestepEncoder() │ │ ├── UseContext(context=diffusion, key=timestep) │ │ ├── (CHAIN) RangeEncoder(sinusoidal_embedding_dim=320, embedding_dim=1280) ... │ │ └── SetContext(context=range_adapter, key=timestep_embedding_tile) │ ├── Slicing(dim=1, end=4) │ ├── (CHAIN) DownBlocks(in_channels=4) │ │ ├── (CHAIN) #1 ... │ │ ├── (CHAIN) (x2) #2 ... │ │ ├── (CHAIN) #3 ... 0: Tensor(shape=(2, 4, 144, 112), dtype=bfloat16, device=cuda:0, min=-3.31, max=4.03, mean=0.48, std=1.11, norm=433.98, grad=False)`