Open zishen-ucap opened 4 weeks ago
That first video is not loading for me.
That first video is not loading for me.
I have re edited it, and the first video should be able to be loaded now.
The prompt is
A stark white cat with piercing green eyes stealthily creeps along the glossy, dark wooden floorboards of an opulent European-style parlor. Intricate wainscoting and luxurious wallpaper frame the scene. In a heart-racing moment, the cat pounces with precision towards a unsuspecting brown mouse near an ornate, mahogany furniture piece. The high-definition footage captures every nuanced movement, from the flutter of the cat's whiskers to the mouse's flickering tail, as an antique grandfather clock ticks rhythmically in the background. The elegant room's vintage charm contrasts with the primal, instinctive encounter unfolding within it.
Ah, well it's hard to tell where the difference comes from, it looks more like just different seed/noise. Lots of things are slightly different to the original.
Also the step count you show in the parameters does not match at all?
This is a video with steps set to 50
https://github.com/user-attachments/assets/d0be3f38-95c1-43f0-8bb7-1a0af753c605
I don't think the number of steps is the main factor affecting the overall quality of the video. I set the steps to 50, matching the setting in ComfyUI, which seemed to only influence the details. At one point, I wondered if there might be an issue with the CLIP weights I downloaded, but the red panda video I generated appears to match the demo you provided.
https://github.com/user-attachments/assets/46b1a8c6-ad5a-4060-92b6-9484cdc6199b
One thing I see wrong is that you have fp16 selected, it should be bf16 for the model.
One thing I see wrong is that you have fp16 selected, it should be bf16 for the model.
I changed it to BP16 and the effect is still similar...
https://github.com/user-attachments/assets/6d85dff6-2bc6-4b57-a78e-05f4b0b81a57
It's clearly different noise as the scene is totally different, probably can't compare 1:1 between seeds.
Oh and also I believe the original repo still uses flash_attention as default if it's available.
GenmoAI uses 200 steps, try it for one generation and see
Oh and also I believe the original repo still uses flash_attention as default if it's available.
I have tried replacing it with flashatt, but the result is almost the same as sdpa
genmo also use they own vae_tiling I guess
genmo also use they own vae_tiling I guess
They did not even have VAE tiling in the code, they implemented the same I did, they also implemented other things from this repo. Their online demo doesn't use tiling as it's not needed on those GPUs, they also do an upscale pass there, so those results can't be directly compared.
They started to make tiling for single GPU pipeline, but code is incomleted.
they also do an upscale pass there
Do You mean upscaling on website generation, or in the code in their repo?
they also do an upscale pass there
Do You mean upscaling on website generation, or in the code in their repo?
I mean the website generation.
Yep, they also use their autoprompter to expand user prompt. Difficult to reproduce their results.
genmo also use they own vae_tiling I guess
They did not even have VAE tiling in the code, they implemented the same I did, they also implemented other things from this repo. Their online demo doesn't use tiling as it's not needed on those GPUs, they also do an upscale pass there, so those results can't be directly compared.
I think there is a high probability that the accuracy issue will lead to different results, but I never thought the final inference results would differ so much. I printed out both the input and output results, and found that with the same input, there was a 0.01 level error in the output result with only one step. This is my printing details on model_fn:
def model_fn(*, z, sigma, cfg_scale):
#print("z", z.dtype, z.device)
#print("sigma", sigma.dtype, sigma.device)
self.dit.to(self.device)
if batch_cfg:
print(f'z[0][1][2][3][0:4]={z[0][1][2][3][0:4]}')
print(f'sigma={sigma}')
print(f'sample_batched["y_feat"][0][1][2][0:3]={sample_batched["y_feat"][0][1][2][0:3]}')
print(f'sample_batched["y_mask"][0][125]={sample_batched["y_mask"][0][1][125]}')
print(f'sample_batched["packed_indices"]["cu_seqlens_kv"]={sample_batched["packed_indices"]["cu_seqlens_kv"]}')
with torch.autocast("cuda", dtype=torch.bfloat16):
out = self.dit(z, sigma, **sample_batched)
out_cond, out_uncond = torch.chunk(out, chunks=2, dim=0)
print(f'out_cond[0][1][2][3][0:4]={out_cond[0][1][2][3][0:4]}')
print(f'out_uncond[0][1][2][3][0:4]={out_uncond[0][1][2][3][0:4]}')
else:
nonlocal sample, sample_null
with torch.autocast("cuda", dtype=torch.bfloat16):
out_cond = self.dit(z, sigma, **sample)
out_uncond = self.dit(z, sigma, **sample_null)
assert out_cond.shape == out_uncond.shape
return out_uncond + cfg_scale * (out_cond - out_uncond), out_cond
This is the result of Comfyui:
z[0][1][2][3][0:4]=tensor([ 0.1773, -0.2844, 1.0710, -1.0579], device='cuda:0')
sigma=tensor([1., 1.], device='cuda:0')
sample_batched["y_feat"][0][1][2][0:3]=tensor([-0.0096, -0.0063, -0.0182], device='cuda:0')
sample_batched["y_mask"][0][125]=False
sample_batched["packed_indices"]["cu_seqlens_kv"]=tensor([ 0, 44682, 89202], device='cuda:0', dtype=torch.int32)
out_cond[0][1][2][3][0:4]=tensor([ 0.0537, 0.5039, -0.8242, 1.2969], device='cuda:0',
dtype=torch.bfloat16)
out_uncond[0][1][2][3][0:4]=tensor([-0.2021, 0.2559, -1.1094, 1.0078], device='cuda:0',
dtype=torch.bfloat16)
This is the result of Genmo:
(T2VSynthMochiModel pid=1020143) z[0][1][2][3][0:4]=tensor([ 0.1773, -0.2844, 1.0710, -1.0579], device='cuda:0')
(T2VSynthMochiModel pid=1020143) sigma=tensor([1., 1.], device='cuda:0')
(T2VSynthMochiModel pid=1020143) sample_batched["y_feat"][0][1][2][0:3]=tensor([-0.0096, -0.0063, -0.0182], device='cuda:0')
(T2VSynthMochiModel pid=1020143) sample_batched["y_mask"][0][125]=False
(T2VSynthMochiModel pid=1020143) sample_batched["packed_indices"]["cu_seqlens_kv"]=tensor([ 0, 44682, 89202], device='cuda:0', dtype=torch.int32)
(T2VSynthMochiModel pid=1020143) out_cond[0][1][2][3][0:4]=tensor([ 0.0703, 0.5156, -0.8477, 1.2969], device='cuda:0',
(T2VSynthMochiModel pid=1020143) dtype=torch.bfloat16)
(T2VSynthMochiModel pid=1020143) out_uncond[0][1][2][3][0:4]=tensor([-0.2168, 0.2207, -1.1328, 1.0078], device='cuda:0',
(T2VSynthMochiModel pid=1020143) dtype=torch.bfloat16)
Thank you very much for your open-source project. But in this job, I found that the ComfyUI node you provided and the MOCHI inference result of genmoai are different. I don't know where my operation went wrong.
This is the parameter I set on https://github.com/genmoai/models/blob/main/demos/cli.py
This is a setting on ComfyUI, and the prompt is the same
https://github.com/user-attachments/assets/461850ed-e465-4832-850f-153308717fb7
This is the effect of Genmoai
https://github.com/user-attachments/assets/41377785-43a5-476a-8f34-c6d473f18022
This is the effect of ComfyUI