flux1.dev error and huge amount of memory

azrahello commented 1 month ago

Thanks for this project. I encountered some problems with memory: using the Schnell workflow, it uses a huge amount of memory and is slower than when not using MLX. With the workflow provided using flux.1-dev, I encountered this issue.

2024-10-03 08:22:16,359 - root - INFO - got prompt
2024-10-03 08:22:20,165 - root - ERROR - !!! Exception during processing !!! Shapes (1,512) and (1,256) cannot be broadcast.
2024-10-03 08:22:20,167 - root - ERROR - Traceback (most recent call last):
  File "/Volumes/NewHome/ComfyUI/execution.py", line 323, in execute
    output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/NewHome/ComfyUI/execution.py", line 198, in get_output_data
    return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/NewHome/ComfyUI/execution.py", line 169, in _map_node_over_list
    process_inputs(input_dict, i)
  File "/Volumes/NewHome/ComfyUI/execution.py", line 158, in process_inputs
    results.append(getattr(obj, func)(**inputs))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/NewHome/ComfyUI/custom_nodes/ComfyUI-MLX/__init__.py", line 208, in encode
    padded_tokens_t5[:, : t5_tokens.shape[1]] = t5_tokens[
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: Shapes (1,512) and (1,256) cannot be broadcast.

2024-10-03 08:22:20,167 - root - INFO - Prompt executed in 3.81 seconds
2024-10-03 08:22:29,866 - root - INFO - got prompt
2024-10-03 08:22:33,773 - diffusionkit.mlx - INFO - Seed: 1015543225
2024-10-03 08:22:36,222 - diffusionkit.mlx.mmdit - INFO - Cached modulation_params for timesteps=array([1000, 950, 900, ..., 100, 50, 0], dtype=float32)
2024-10-03 08:22:36,223 - diffusionkit.mlx.mmdit - INFO - Cached modulation_params will reduce peak memory by 13.0 GB
2024-10-03 08:25:46,192 - root - INFO - Prompt executed in 196.32 seconds
2024-10-03 08:26:34,294 - root - INFO - got prompt
2024-10-03 08:26:36,655 - root - ERROR - !!! Exception during processing !!! Shapes (1,512) and (1,256) cannot be broadcast.
2024-10-03 08:26:36,655 - root - ERROR - Traceback (most recent call last):
  File "/Volumes/NewHome/ComfyUI/execution.py", line 323, in execute
    output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/NewHome/ComfyUI/execution.py", line 198, in get_output_data
    return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/NewHome/ComfyUI/execution.py", line 169, in _map_node_over_list
    process_inputs(input_dict, i)
  File "/Volumes/NewHome/ComfyUI/execution.py", line 158, in process_inputs
    results.append(getattr(obj, func)(**inputs))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Volumes/NewHome/ComfyUI/custom_nodes/ComfyUI-MLX/__init__.py", line 208, in encode
    padded_tokens_t5[:, : t5_tokens.shape[1]] = t5_tokens[
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: Shapes (1,512) and (1,256) cannot be broadcast.

2024-10-03 08:26:36,655 - root - INFO - Prompt executed in 2.36 seconds

Attached Workflow

Please make sure that workflow does not contain any sensitive information such as API keys or passwords.

{"last_node_id":84,"last_link_id":104,"nodes":[{"id":80,"type":"MLXDecoder","pos":{"0":-1457,"1":877},"size":{"0":229.20001220703125,"1":46},"flags":{},"order":4,"mode":0,"inputs":[{"name":"latent_image","type":"LATENT","link":97},{"name":"vae","type":"VAE","link":95}],"outputs":[{"name":"IMAGE","type":"IMAGE","links":[100],"slot_index":0,"shape":3}],"properties":{"Node name for S&R":"MLXDecoder"},"widgets_values":[]},{"id":82,"type":"MLXSaveImage","pos":{"0":-1180,"1":604},"size":{"0":315,"1":270},"flags":{},"order":5,"mode":0,"inputs":[{"name":"image","type":"IMAGE","link":100}],"outputs":[],"properties":{"Node name for S&R":"MLXSaveImage"},"widgets_values":["ComfyUI"]},{"id":33,"type":"EmptyLatentImage","pos":{"0":-2349,"1":940},"size":{"0":315,"1":106},"flags":{},"order":0,"mode":0,"inputs":[],"outputs":[{"name":"LATENT","type":"LATENT","links":[90],"slot_index":0,"shape":3}],"properties":{"Node name for S&R":"EmptyLatentImage"},"widgets_values":[768,1152,1]},{"id":78,"type":"MLXSampler","pos":{"0":-1858,"1":530},"size":{"0":315,"1":194},"flags":{},"order":3,"mode":0,"inputs":[{"name":"model","type":"MODEL","link":88},{"name":"positive","type":"CONDITIONING","link":104},{"name":"latent_image","type":"LATENT","link":90,"slot_index":2}],"outputs":[{"name":"LATENT","type":"LATENT","links":[97],"slot_index":0,"shape":3}],"properties":{"Node name for S&R":"MLXSampler"},"widgets_values":[2073252424,"randomize",20,0,1]},{"id":84,"type":"MLXClipTextEncoder","pos":{"0":-2452,"1":394},"size":{"0":400,"1":200},"flags":{},"order":2,"mode":0,"inputs":[{"name":"clip","type":"CLIP","link":103}],"outputs":[{"name":"CONDITIONING","type":"CONDITIONING","links":[104],"slot_index":0}],"properties":{"Node name for S&R":"MLXClipTextEncoder"},"widgets_values":["dranon in the sky"]},{"id":73,"type":"MLXLoadFlux","pos":{"0":-3104,"1":833},"size":{"0":511.4536437988281,"1":98},"flags":{},"order":1,"mode":0,"inputs":[{"name":"model_version","type":"SELECT","link":null,"slot_index":0}],"outputs":[{"name":"MODEL","type":"MODEL","links":[88],"slot_index":0,"shape":3},{"name":"VAE","type":"VAE","links":[95],"slot_index":1,"shape":3},{"name":"CLIP","type":"CLIP","links":[103],"slot_index":2,"shape":3}],"properties":{"Node name for S&R":"MLXLoadFlux"},"widgets_values":["argmaxinc/mlx-FLUX.1-dev"]}],"links":[[88,73,0,78,0,"MODEL"],[90,33,0,78,2,"LATENT"],[95,73,1,80,1,"VAE"],[97,78,0,80,0,"LATENT"],[100,80,0,82,0,"IMAGE"],[103,73,2,84,0,"CLIP"],[104,84,0,78,1,"CONDITIONING"]],"groups":[{"title":"MLX","bounding":[-3222,271,2527,1051],"color":"#3f789e","font_size":24,"flags":{}}],"config":{},"extra":{"ds":{"scale":0.8140274938684039,"offset":[3120.4918226626387,-220.16840198533816]}},"version":0.4}

Additional Context

(Please add any additional context or steps to reproduce the error here)

thoddnn commented 1 month ago

Hello @azrahello, thank you for your feedback. I’ve just pushed a fix for the issue related to flux.1-dev.

Regarding the memory issue, I’ll need a bit more time to investigate.

Could you let me know what differences you’re noticing in memory usage when using Comfy-MLX compared to not using it? Additionally, could you provide details about the machine you’re using?

azrahello commented 1 month ago

Thank you for the update! I’ve conducted some tests after the update using ComfyUI. I launched two generations while changing only the seed, then restarted ComfyUI and generated two more with dev.1-q8-gguf. I know these are two different setups, but it’s clear that your version performs better than the standard Black Forest version.

However, I find it strange that with DiffusionKit, under the same resolution, prompt, and seed, it is significantly more efficient than the others. I made a video demonstrating all this and the consumption details, which might provide you with more insights. I apologize for not being able to go into too many technical details; English is already a challenge for me.

The last generation was done using DiffusionKit with the parameters kept consistent. One more point to note: ComfyUI-MLX operates in float32, while DiffusionKit uses bfloat16.

Here are the generation times for 25 steps at 768x1152 resolution:

•   First generation (ComfyUI-MLX): 265.77 seconds
•   Second generation (ComfyUI-MLX): 269.56 seconds
•   First generation (dev.1-q8-gguf): 207.86 seconds
•   Second generation (dev.1-q8-gguf): 188.52 seconds

When I ran the command: diffusionkit-cli --prompt "dragon in the sky" --output-path /Volumes/NewHome/mflux/img/diffusionkit_r0x_drako1012.png --seed 1012 --model-version argmaxinc/mlx-FLUX.1-dev --height 1152 --width 768 --steps 25 I achieved a generation time of 121.18 seconds.

These tests were conducted on an Apple M2 Ultra with 64GB of RAM.

Thank you for your help! https://youtu.be/53spHlVoFvw

thoddnn commented 4 weeks ago

Hey @azrahello it should be way better now, I've just added w16 and a16 to True to lower the weight and activation precision.

azrahello commented 4 weeks ago

@thoddnn I also made some modifications. I removed the save image node and was trying to create an MLX encoder for inpainting. I removed the save node so that the image can be sent to the upscaler, but I’m not sure how to upload the changes I made to GitHub.”

thoddnn commented 4 weeks ago

@azrahello I’d love to help, but I’m not sure where you made your changes. Can you let me know what you're trying to do exactly?

If you’re just looking to use the image output from the MLX Decoder to upscale it without saving the image, I can add a simple PreviewImage node for you.

Does that work?

Or feel free to share a workflow that does the same thing without the MLX nodes, and I’ll take a look! :)

azrahello commented 4 weeks ago

@thoddnn “In some ways, the core nodes of COMFYUI, like load image, preview image, and save image, work as expected. From my point of view, it’s unnecessary to add another save image node that doesn’t add anything beyond what already exists. Additionally, outputting an image from the MLX decoder that can’t be used with the existing nodes significantly limits this workflow. What I did was modify the ‘MLXDecoder’ class by converting the MLX array into a NumPy array, so the core ComfyUI save image and preview image nodes can be used. The advantage is that, in this way, the workflow becomes more open to further processing.


class MLXDecoder:
    @classmethod
    def INPUT_TYPES(s):
        return {"required": { "latent_image": ("LATENT", ), "vae": ("VAE", )}}

    RETURN_TYPES = ("IMAGE",)
    FUNCTION = "decode"

    def decode(self, latent_image, vae):

        decoded = vae(latent_image)
        decoded = mx.clip(decoded / 2 + 0.5, 0, 1)

        mx.eval(decoded)

       # Converti l'output MLX in un tensor PyTorch
        decoded_numpy = np.array(decoded)
        decoded_torch = torch.from_numpy(decoded_numpy).float()

        # Assicurati che l'output sia nel formato corretto (B, H, W, C)
        if decoded_torch.dim() == 3:
            decoded_torch = decoded_torch.unsqueeze(0)

        # Assicurati che i valori siano nel range [0, 1]
        decoded_torch = torch.clamp(decoded_torch, 0, 1)

        return (decoded_torch,)

Now, I’m trying to figure out how to create a node that converts the ‘load image’ from ComfyUI into a latent image suitable for the MLX sampler, but without success because I’ve messed everything up and now nothing works anymore! :P

By the way, your latest update has been a game-changer for both speed and memory usage! thanks!

thoddnn commented 4 weeks ago

You're absolutely right!

I've just updated the MLX Decoder node, so now it works smoothly with SaveImage, PreviewImage and others basic nodes.

Closing the issue now—thanks for your feedback and contribution!

Hope to see a PR from you soon! :)

PS: Re-install the nodes and it should be ok. I will add the MLX nodes into the Custom Nodes Manager for it to be easier to install/update

thoddnn / ComfyUI-MLX

flux1.dev error and huge amount of memory #3

Attached Workflow

Additional Context