"mul_cuda" not implemented for 'Float8_e4m3fn'

rkfg commented 2 weeks ago

I can't use fp8 transformer on my 3090 Ti, 24 GB. Tried PyTorch nightly (2.5.0) and the latest release 2.4.0, same error every time: 2024-08-28_12-14-24 Since it seemingly works for many people I assume it's my setup issue but what should I look for? CUI is run in a docker container. I can run Flux in the same 8 bit format (e4m3fn) just fine. Cuda version is 12.1, tried 12.4 as well with the same outcome.

kijai commented 2 weeks ago

Not sure why you'd get that error when I initially got past that by casting the latent to bf16 for that specific function, as it can't work with fp8...

That said, there's not much benefit for using the fp8 with 24GB, it's not any faster and sampling 49 frames easily fits 24GB at bf16. The VAE decode is the most memory intensive part and fp8 can't be used there.

rkfg commented 2 weeks ago

Yes, I had some issues with OOM when I tried rendering 48 frames but probably for unrelated reasons (another program didn't clean up properly), I can generate just fine now. Still, it would be interesting to find the actual reason I think. I install plenty of dependencies while building the container (the base + several extensions) but this error looks pretty low level. So I checked the torch and diffusers versions, both look fine, i.e. satisfy this extension's constraints.

kijai commented 2 weeks ago

Actually which sampler were you using? I only tested it with DPM.

rkfg commented 2 weeks ago

Huh, I used DDIM because when I tried DPM a while ago it wasn't supported for text2video or such. I'll try DPM later, thank you for the idea!

kijai commented 2 weeks ago

Huh, I used DDIM because when I tried DPM a while ago it wasn't supported for text2video or such. I'll try DPM later, thank you for the idea!

DPM is lot better for the 5B model, just the temporal tiling is not supported for it

thezveroboy commented 2 weeks ago

i've got the same problem in vid2vid pipeline on 3060ti because of OOM without fp8_transformer enabled is there any way to play with it?

kijai commented 2 weeks ago

i've got the same problem in vid2vid pipeline on 3060ti because of OOM without fp8_transformer enabled is there any way to play with it?

Did you try using DPM sampler?

thezveroboy commented 2 weeks ago

Did you try using DPM sampler?

yes i did but got OOM too i'd try DPM + bf16+fp8_transformer + enabled_tiled_vae but it works for t2v except of v2v

rkfg commented 2 weeks ago

DPM with fp8 indeed works! The model is very slim in this case, good to know I can run ≈3x bigger models like that. Now I think to close this issue either DDIM should be marked as unsupported with fp8 (with an explicit error explaining it) or fixed to work with it. Thank you!

thezveroboy commented 2 weeks ago

i'd try to use DPM in v2v but OOM t2v works fine with DPM and DDIM

!!! Exception during processing!!! Allocation on device Traceback (most recent call last): File "D:\ComfyUI_windows_portable\ComfyUI\execution.py", line 152, in recursive_execute output_data, output_ui = get_output_data(obj, input_data_all) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\ComfyUI\execution.py", line 82, in get_output_data return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\ComfyUI\execution.py", line 75, in map_node_over_list results.append(getattr(obj, func)(slice_dict(input_data_all, i))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-CogVideoXWrapper-main\nodes.py", line 184, in encode latents = vae.encode(image_chunk) ^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\diffusers\utils\accelerate_utils.py", line 46, in wrapper return method(self, *args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl_cogvideox.py", line 1100, in encode h = self.encoder(x) ^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl_cogvideox.py", line 733, in forward hidden_states = down_block(hidden_states, temb, None) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl_cogvideox.py", line 409, in forward hidden_states = resnet(hidden_states, temb, zq) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl_cogvideox.py", line 291, in forward hidden_states = self.conv1(hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl_cogvideox.py", line 142, in forward inputs = F.pad(inputs, padding_2d, mode="constant", value=0) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\functional.py", line 4552, in pad return torch._C._nn.pad(input, pad, mode, value) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.OutOfMemoryError: Allocation on device

Got an OOM, unloading all loaded models.

kijai commented 2 weeks ago

i'd try to use DPM in v2v but OOM t2v works fine with DPM and DDIM

!!! Exception during processing!!! Allocation on device Traceback (most recent call last): File "D:\ComfyUI_windows_portable\ComfyUI\execution.py", line 152, in recursive_execute output_data, output_ui = get_output_data(obj, input_data_all) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\ComfyUI\execution.py", line 82, in get_output_data return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\ComfyUI\execution.py", line 75, in map_node_over_list results.append(getattr(obj, func)(slice_dict(input_data_all, i))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-CogVideoXWrapper-main\nodes.py", line 184, in encode latents = vae.encode(image_chunk) ^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\diffusers\utils\accelerate_utils.py", line 46, in wrapper return method(self, *args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl_cogvideox.py", line 1100, in encode h = self.encoder(x) ^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl_cogvideox.py", line 733, in forward hidden_states = down_block(hidden_states, temb, None) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl_cogvideox.py", line 409, in forward hidden_states = resnet(hidden_states, temb, zq) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl_cogvideox.py", line 291, in forward hidden_states = self.conv1(hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl_cogvideox.py", line 142, in forward inputs = F.pad(inputs, padding_2d, mode="constant", value=0) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\torch\nn\functional.py", line 4552, in pad return torch._C._nn.pad(input, pad, mode, value) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.OutOfMemoryError: Allocation on device

Got an OOM, unloading all loaded models.

Make sure the nodes are updated, there should be a chunk_size option in the encode node now, reduce that value to use less memory.

kijai / ComfyUI-CogVideoXWrapper

"mul_cuda" not implemented for 'Float8_e4m3fn' #22