Update 8 attempt to run on mac...

cchance27 commented 2 days ago

So with fp32... Sampler fails with... (using gguf i2v model) seems somewhere your doing a operation where element types don't match, (not in an autocast?), sadly its crashing out at the shadergraph so it doesn't tell me where in the pipeline its crashing out, i'll try to open comfy in vscode to see if i can step through to where its crashing later...

Sampling 53 frames in 13 latent frames at 608x400 with 25 inference steps
  0%|                                                                                            | 0/25 [00:00<?, ?it/s](mpsFileLoc): /AppleInternal/Library/BuildRoots/259aefee-9a42-11ef-8b4c-6e654a286000/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.add' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/259aefee-9a42-11ef-8b4c-6e654a286000/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %7 = "mps.add"(%5, %arg2) : (tensor<2x512xf32>, tensor<512xbf16>) -> tensor<*xf32>
(mpsFileLoc): /AppleInternal/Library/BuildRoots/259aefee-9a42-11ef-8b4c-6e654a286000/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.add' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/259aefee-9a42-11ef-8b4c-6e654a286000/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %7 = "mps.add"(%5, %arg2) : (tensor<2x512xf32>, tensor<512xbf16>) -> tensor<*xf32>
/AppleInternal/Library/BuildRoots/259aefee-9a42-11ef-8b4c-6e654a286000/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:975: failed assertion `original module failed verification'
[1]    48943 abort      PYTORCH_ENABLE_MPS_FALLBACK=1 python main.py --enable-cors-header --listen
/opt/homebrew/anaconda3/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

When set to bf16, get a different issue, seems that on macs it doesn't support linear non-float bias...which im guessing means that MPS doesn't support bfloat16 linear bias... for some reason

Sampling 53 frames in 13 latent frames at 608x400 with 25 inference steps
  0%|                                                                                            | 0/25 [00:00<?, ?it/s]
ERROR:root:!!! Exception during processing !!! MPS device does not support linear for non-float bias
!!! Exception during processing !!! MPS device does not support linear for non-float bias
ERROR:root:Traceback (most recent call last):
  File "/Volumes/2TB/AI/ComfyUI/execution.py", line 323, in execute
    output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/execution.py", line 198, in get_output_data
    return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/execution.py", line 169, in _map_node_over_list
    process_inputs(input_dict, i)
  File "/Volumes/2TB/AI/ComfyUI/execution.py", line 158, in process_inputs
    results.append(getattr(obj, func)(**inputs))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/nodes.py", line 689, in process
    latents = model["pipe"](
              ^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/pipeline_cogvideox.py", line 750, in __call__
    noise_pred = self.transformer(
                 ^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/custom_cogvideox_transformer_3d.py", line 685, in forward
    hidden_states, encoder_hidden_states = block(
                                           ^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/custom_cogvideox_transformer_3d.py", line 253, in forward
    norm_hidden_states, norm_encoder_hidden_states, gate_msa, enc_gate_msa = self.norm1(
                                                                             ^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/diffusers/models/normalization.py", line 455, in forward
    shift, scale, gate, enc_shift, enc_scale, enc_gate = self.linear(self.silu(temb)).chunk(6, dim=1)
                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 125, in forward
    return F.linear(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: MPS device does not support linear for non-float bias

Traceback (most recent call last):
  File "/Volumes/2TB/AI/ComfyUI/execution.py", line 323, in execute
    output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/execution.py", line 198, in get_output_data
    return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/execution.py", line 169, in _map_node_over_list
    process_inputs(input_dict, i)
  File "/Volumes/2TB/AI/ComfyUI/execution.py", line 158, in process_inputs
    results.append(getattr(obj, func)(**inputs))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/nodes.py", line 689, in process
    latents = model["pipe"](
              ^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/pipeline_cogvideox.py", line 750, in __call__
    noise_pred = self.transformer(
                 ^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/custom_cogvideox_transformer_3d.py", line 685, in forward
    hidden_states, encoder_hidden_states = block(
                                           ^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/custom_cogvideox_transformer_3d.py", line 253, in forward
    norm_hidden_states, norm_encoder_hidden_states, gate_msa, enc_gate_msa = self.norm1(
                                                                             ^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/diffusers/models/normalization.py", line 455, in forward
    shift, scale, gate, enc_shift, enc_scale, enc_gate = self.linear(self.silu(temb)).chunk(6, dim=1)
                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 125, in forward
    return F.linear(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: MPS device does not support linear for non-float bias

cchance27 commented 2 days ago

Just checked the fp32 crash happens when it calls... emb = self.time_embedding(t_emb, timestep_cond)

at line 588 in custom_cogvideox_transformer_3d.py

whats odd is that i only see it passing in t_emb which is float32, and timestep_cond is none....

john2stai commented 1 day ago

I am not sure if you insisted to run fp32 or not, but I have success with running kijai's 5b i2v and t2v models at bf16 on my macbook pro. This is the first time I could run cogvideo workflow on my mac since its release! :D

kijai commented 1 day ago

I am not sure if you insisted to run fp32 or not, but I have success with running kijai's 5b i2v and t2v models at bf16 on my macbook pro. This is the first time I could run cogvideo workflow on my mac since its release! :D

Could be because I just removed the autocast when using bf16 and fp16 too, figured it's only needed for fp8 and GGUF anymore.

cchance27 commented 1 day ago

Well I tried all 3 fp16 not supported Fp32 above crash Bf16 other above crash on mps

by all means up to trying things

this was all with the 1.5 gguf i2v model in the drop down

kijai commented 1 day ago

How about the other models, 1.5 does not work with fp16 on any hardware currently.

cchance27 commented 1 day ago

hadn't tried will try the i2v 5b gguf on the different dtypes and see if that works, since person above mentions they had it working on 5b (he didnt specify 1.5 or 1.0)...

i did a fresh pull of the repo just now and confirmed the 1.5 tested the same errors as above still, im waiting on the 5b to download HF being slow

kijai commented 1 day ago

hadn't tried will try the i2v 5b gguf on the different dtypes and see if that works, since person above mentions they had it working on 5b (he didnt specify 1.5 or 1.0)...

i did a fresh pull of the repo just now and confirmed the 1.5 tested the same errors as above still, im waiting on the 5b to download HF being slow

Another thing to try is the "comfy" attention mode that's now available, that's if you get past the temb part, comfy has set it up to be more compatible in general.

cchance27 commented 1 day ago

So on 5b_I2V_GGUF_Q4_0, i don't get the really badd mps.add crash that bombs python itself, but all 3 panic out of the sampler

fp16 is causing...

  File "/Volumes/2TB/AI/ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/custom_cogvideox_transformer_3d.py", line 256, in forward
    norm_hidden_states, norm_encoder_hidden_states, gate_msa, enc_gate_msa = self.norm1(

    TypeError: Trying to convert Float8_e4m3fn to the MPS backend but it does not have support for that dtype.

Not sure why its trying fp8 when set to fp16?

fp32 gives

 File "/Volumes/2TB/AI/ComfyUI/venv/lib/python3.11/site-packages/diffusers/models/normalization.py", line 456, in forward
    hidden_states = self.norm(hidden_states) * (1 + scale)[:, None, :] + shift[:, None, :]
                    ^^^^^^^^^^^^^^^^^^^^^^^^

 RuntimeError: Promotion for Float8 Types is not supported, attempted to promote Float8_e4m3fn and Float

bf16 gives

  File "/Volumes/2TB/AI/ComfyUI/custom_nodes/ComfyUI-CogVideoXWrapper/custom_cogvideox_transformer_3d.py", line 256, in forward
    norm_hidden_states, norm_encoder_hidden_states, gate_msa, enc_gate_msa = self.norm1(

RuntimeError: Promotion for Float8 Types is not supported, attempted to promote Float8_e4m3fn and BFloat16

cchance27 commented 1 day ago

Another thing to try is the "comfy" attention mode that's now available, that's if you get past the temb part, comfy has set it up to be more compatible in general.

I'm using the down(load) gguf node, and seems you didnt add the attention modes to that one, only 2 sdpa and sage

kijai commented 1 day ago

These GGUF models actually use fp8 for some of the weights currently.

cchance27 commented 1 day ago

swapped to NON-gguf (down)load using THUDM/5b-i2v ... and it seems to be going on bf16 albeit ... SLOW [07:02<2:49:03, 422.66s/it] but it didnt crash

These GGUF models actually use fp8 for some of the weights currently.

Ok, Will avoid using the gguf, I guess since likely that fp8 will break things on macs, thats disappointing, let me get some other models downloaded on non-gguf and see if they work

Won't that cause issues for other gpu's as well that dont support fp8, or do nvidia just autocast it internally and mps doesnt

kijai commented 1 day ago

Fp8 is very much limited by hardware support in any case. I have also now added support for torchao quantization, but I have no clue if they support MPS at all.

cchance27 commented 1 day ago

HAHA don't think so sadly,

Also just confirmed, 1.5 i2v bf16 is also working on non-gguf version, so seems its the various gguf's blowing things up as non gguf 1.5 i get... 4%|███▎ | 1/25 [02:18<55:31, 138.83s/it]

But ya... even 1.5 is pretty damn slow still not 422s/it but 138s/it XD

Question... on 1.5 i see INFO:ComfyUI-CogVideoXWrapper.pipeline_cogvideox:Sampling 53 frames in 13 latent frames at 720x480 with 25 inference steps

on 1.0 it showed INFO:ComfyUI-CogVideoXWrapper.pipeline_cogvideox:Sampling 49 frames in 13 latent frames at 720x480 with 25 inference steps

How come 1.0 showed more frames than the ksampler was set to as i didnt change it from the default.

Fp8 is very much limited by hardware support in any case. I have also now added support for torchao quantization, but I have no clue if they support MPS at all.

Ya surprised to see the gguf with fp8 isn't the standard for Q4 ... Q4+FP16

kijai commented 1 day ago

HAHA don't think so sadly,

Also just confirmed, 1.5 i2v bf16 is also working on non-gguf version, so seems its the various gguf's blowing things up as non gguf 1.5 i get... 4%|███▎ | 1/25 [02:18<55:31, 138.83s/it]

But ya... even 1.5 is pretty damn slow still not 422s/it but 138s/it XD

Question... on 1.5 i see INFO:ComfyUI-CogVideoXWrapper.pipeline_cogvideox:Sampling 53 frames in 13 latent frames at 720x480 with 25 inference steps

on 1.0 it showed INFO:ComfyUI-CogVideoXWrapper.pipeline_cogvideox:Sampling 49 frames in 13 latent frames at 720x480 with 25 inference steps

How come 1.0 showed more frames than the ksampler was set to as i didnt change it from the default.

Fp8 is very much limited by hardware support in any case. I have also now added support for torchao quantization, but I have no clue if they support MPS at all.

Ya surprised to see the gguf with fp8 isn't the standard for Q4 ... Q4+FP16

With 1.5 the first latent is noisy, so it's padded and later removed. One latent has 4 frames as it's packed temporally too.

kijai / ComfyUI-CogVideoXWrapper

Update 8 attempt to run on mac... #254