Why is the effect of using this node for inference different from the original effect of genmoai?

zishen-ucap commented 4 weeks ago

Thank you very much for your open-source project. But in this job, I found that the ComfyUI node you provided and the MOCHI inference result of genmoai are different. I don't know where my operation went wrong.

@click.command()
@click.option("--prompt", required=True, help="Prompt for video generation.")
@click.option(
    "--negative_prompt", default="", help="Negative prompt for video generation."
)
@click.option("--width", default=848, type=int, help="Width of the video.")
@click.option("--height", default=480, type=int, help="Height of the video.")
@click.option("--num_frames", default=163, type=int, help="Number of frames.")
@click.option("--seed", default=42, type=int, help="Random seed.")
@click.option("--cfg_scale", default=4.5, type=float, help="CFG Scale.")
@click.option(
    "--num_steps", default=99, type=int, help="Number of inference steps."
)
@click.option("--model_dir", required=True, help="Path to the model directory.")

This is the parameter I set on https://github.com/genmoai/models/blob/main/demos/cli.py

3BAF6F32-9ED9-474d-8F38-02CD63EA83EE

This is a setting on ComfyUI, and the prompt is the same

https://github.com/user-attachments/assets/461850ed-e465-4832-850f-153308717fb7

This is the effect of Genmoai

https://github.com/user-attachments/assets/41377785-43a5-476a-8f34-c6d473f18022

This is the effect of ComfyUI

kijai commented 4 weeks ago

That first video is not loading for me.

zishen-ucap commented 4 weeks ago

That first video is not loading for me.

I have re edited it, and the first video should be able to be loaded now. The prompt is A stark white cat with piercing green eyes stealthily creeps along the glossy, dark wooden floorboards of an opulent European-style parlor. Intricate wainscoting and luxurious wallpaper frame the scene. In a heart-racing moment, the cat pounces with precision towards a unsuspecting brown mouse near an ornate, mahogany furniture piece. The high-definition footage captures every nuanced movement, from the flutter of the cat's whiskers to the mouse's flickering tail, as an antique grandfather clock ticks rhythmically in the background. The elegant room's vintage charm contrasts with the primal, instinctive encounter unfolding within it.

kijai commented 4 weeks ago

Ah, well it's hard to tell where the difference comes from, it looks more like just different seed/noise. Lots of things are slightly different to the original.

Also the step count you show in the parameters does not match at all?

zishen-ucap commented 4 weeks ago

This is a video with steps set to 50

https://github.com/user-attachments/assets/d0be3f38-95c1-43f0-8bb7-1a0af753c605

I don't think the number of steps is the main factor affecting the overall quality of the video. I set the steps to 50, matching the setting in ComfyUI, which seemed to only influence the details. At one point, I wondered if there might be an issue with the CLIP weights I downloaded, but the red panda video I generated appears to match the demo you provided.

https://github.com/user-attachments/assets/46b1a8c6-ad5a-4060-92b6-9484cdc6199b

kijai commented 4 weeks ago

One thing I see wrong is that you have fp16 selected, it should be bf16 for the model.

zishen-ucap commented 4 weeks ago

One thing I see wrong is that you have fp16 selected, it should be bf16 for the model.

I changed it to BP16 and the effect is still similar...

https://github.com/user-attachments/assets/6d85dff6-2bc6-4b57-a78e-05f4b0b81a57

kijai commented 4 weeks ago

It's clearly different noise as the scene is totally different, probably can't compare 1:1 between seeds.

kijai commented 3 weeks ago

Oh and also I believe the original repo still uses flash_attention as default if it's available.

jpgallegoar commented 3 weeks ago

GenmoAI uses 200 steps, try it for one generation and see

zishen-ucap commented 3 weeks ago

Oh and also I believe the original repo still uses flash_attention as default if it's available.

I have tried replacing it with flashatt, but the result is almost the same as sdpa

ptits commented 3 weeks ago

genmo also use they own vae_tiling I guess

kijai commented 3 weeks ago

genmo also use they own vae_tiling I guess

They did not even have VAE tiling in the code, they implemented the same I did, they also implemented other things from this repo. Their online demo doesn't use tiling as it's not needed on those GPUs, they also do an upscale pass there, so those results can't be directly compared.

ptits commented 3 weeks ago

They started to make tiling for single GPU pipeline, but code is incomleted.

ptits commented 3 weeks ago

they also do an upscale pass there

Do You mean upscaling on website generation, or in the code in their repo?

kijai commented 3 weeks ago

they also do an upscale pass there

Do You mean upscaling on website generation, or in the code in their repo?

I mean the website generation.

ptits commented 3 weeks ago

Yep, they also use their autoprompter to expand user prompt. Difficult to reproduce their results.

zishen-ucap commented 3 weeks ago

genmo also use they own vae_tiling I guess

They did not even have VAE tiling in the code, they implemented the same I did, they also implemented other things from this repo. Their online demo doesn't use tiling as it's not needed on those GPUs, they also do an upscale pass there, so those results can't be directly compared.

I think there is a high probability that the accuracy issue will lead to different results, but I never thought the final inference results would differ so much. I printed out both the input and output results, and found that with the same input, there was a 0.01 level error in the output result with only one step. This is my printing details on model_fn：

def model_fn(*, z, sigma, cfg_scale):
            #print("z", z.dtype, z.device)
            #print("sigma", sigma.dtype, sigma.device)
            self.dit.to(self.device)
            if batch_cfg:
                print(f'z[0][1][2][3][0:4]={z[0][1][2][3][0:4]}')
                print(f'sigma={sigma}')
                print(f'sample_batched["y_feat"][0][1][2][0:3]={sample_batched["y_feat"][0][1][2][0:3]}')
                print(f'sample_batched["y_mask"][0][125]={sample_batched["y_mask"][0][1][125]}')
                print(f'sample_batched["packed_indices"]["cu_seqlens_kv"]={sample_batched["packed_indices"]["cu_seqlens_kv"]}')
                with torch.autocast("cuda", dtype=torch.bfloat16):
                    out = self.dit(z, sigma, **sample_batched)
                out_cond, out_uncond = torch.chunk(out, chunks=2, dim=0)
                print(f'out_cond[0][1][2][3][0:4]={out_cond[0][1][2][3][0:4]}')
                print(f'out_uncond[0][1][2][3][0:4]={out_uncond[0][1][2][3][0:4]}')
            else:

                nonlocal sample, sample_null
                with torch.autocast("cuda", dtype=torch.bfloat16):
                    out_cond = self.dit(z, sigma, **sample)
                    out_uncond = self.dit(z, sigma, **sample_null)
            assert out_cond.shape == out_uncond.shape

            return out_uncond + cfg_scale * (out_cond - out_uncond), out_cond

This is the result of Comfyui：

z[0][1][2][3][0:4]=tensor([ 0.1773, -0.2844,  1.0710, -1.0579], device='cuda:0')
sigma=tensor([1., 1.], device='cuda:0')
sample_batched["y_feat"][0][1][2][0:3]=tensor([-0.0096, -0.0063, -0.0182], device='cuda:0')
sample_batched["y_mask"][0][125]=False
sample_batched["packed_indices"]["cu_seqlens_kv"]=tensor([    0, 44682, 89202], device='cuda:0', dtype=torch.int32)
out_cond[0][1][2][3][0:4]=tensor([ 0.0537,  0.5039, -0.8242,  1.2969], device='cuda:0',
       dtype=torch.bfloat16)
out_uncond[0][1][2][3][0:4]=tensor([-0.2021,  0.2559, -1.1094,  1.0078], device='cuda:0',
       dtype=torch.bfloat16)

This is the result of Genmo：

(T2VSynthMochiModel pid=1020143) z[0][1][2][3][0:4]=tensor([ 0.1773, -0.2844,  1.0710, -1.0579], device='cuda:0')
(T2VSynthMochiModel pid=1020143) sigma=tensor([1., 1.], device='cuda:0')
(T2VSynthMochiModel pid=1020143) sample_batched["y_feat"][0][1][2][0:3]=tensor([-0.0096, -0.0063, -0.0182], device='cuda:0')
(T2VSynthMochiModel pid=1020143) sample_batched["y_mask"][0][125]=False
(T2VSynthMochiModel pid=1020143) sample_batched["packed_indices"]["cu_seqlens_kv"]=tensor([    0, 44682, 89202], device='cuda:0', dtype=torch.int32)
(T2VSynthMochiModel pid=1020143) out_cond[0][1][2][3][0:4]=tensor([ 0.0703,  0.5156, -0.8477,  1.2969], device='cuda:0',
(T2VSynthMochiModel pid=1020143)        dtype=torch.bfloat16)
(T2VSynthMochiModel pid=1020143) out_uncond[0][1][2][3][0:4]=tensor([-0.2168,  0.2207, -1.1328,  1.0078], device='cuda:0',
(T2VSynthMochiModel pid=1020143)        dtype=torch.bfloat16)

kijai / ComfyUI-MochiWrapper

Why is the effect of using this node for inference different from the original effect of genmoai? #42