kijai / ComfyUI-PyramidFlowWrapper

MIT License
309 stars 11 forks source link

comfy and gradio_from_pyramide give too much different results #38

Open ptits opened 3 weeks ago

ptits commented 3 weeks ago

I see difference in setup in text encoders

I use t5xxl and clip_l with Dual Clip Loader from your example

they use they own text encoders(two files, i guess the same) and openai/clip-vit-large-patch14

what can cause such a difference?

gradio demo give much better quality

shot_241101_010322

ptits commented 3 weeks ago

https://github.com/user-attachments/assets/baebb2a5-ca77-4d53-baf3-0adb33b5b126

ptits commented 3 weeks ago

https://github.com/user-attachments/assets/f4593f1b-0a01-4f70-b0c3-ef66b4a568b8

ptits commented 3 weeks ago

seed = 0 prompt Two white women with long, flowing blonde hair walking side by side along a wide, sandy beach on a bright, sunny day. Both are mid-laugh, their expressions full of joy and friendship, as they walk in sync, close together, barefoot on the warm sand. The sunlight casts a golden glow over their hair, which flows slightly in

ptits commented 3 weeks ago

shot_241101_010950

kijai commented 3 weeks ago

The text encoder is the same one. You don't have negative prompt, they add the prompts I have as default always:

image

ptits commented 3 weeks ago

it looks like text encoding is here

girls are moving toward camera with any seed in comfy and away from camera in gradio with any seed and cfg

kijai commented 3 weeks ago

it looks like text encoding is here

girls are moving toward camera with any seed in comfy and away from camera in gradio with any seed and cfg

Just tried your prompt + the defaults, and every single seed they walk away from camera, even when prompted otherwise (which is pretty weird)

image

https://github.com/user-attachments/assets/b5575935-46ee-44d0-a945-2fefcfa7eec3

ptits commented 3 weeks ago

Do they use DualClipLoader technics?

ptits commented 3 weeks ago

lady smiling and tilt her head

they are similar - but quality...

https://github.com/user-attachments/assets/7ecdab1f-b3d7-4380-a2df-dee3e090b9f7

https://github.com/user-attachments/assets/a90cb0f9-72bc-490b-927e-735bc028c097

ptits commented 3 weeks ago

shot_241101_024458

kijai commented 3 weeks ago

I'm not sure you are understanding me, they always add those extra prompts, to match the results closer you prompt like this:

image

https://github.com/user-attachments/assets/1f4f1279-db40-473e-a66d-1644b97c9642

kijai commented 3 weeks ago

Do they use DualClipLoader technics?

Yes.

ptits commented 3 weeks ago

yes, you are absolutely correct, my bad

I add positive addition and negative prompt from their actual code from pyramid_dit_for_video_gen_pipeline.py

and got very similar results thank you very much for your patience and explanations

video movements are slightly different but quality is much the same

https://github.com/user-attachments/assets/1d791509-1316-4205-b1f0-c018e207229b

https://github.com/user-attachments/assets/0f4b5e60-83bb-48ce-89be-97f85fc9747d

seed 0 eshot_241101_184202

kijai commented 3 weeks ago

There is actually one small difference to the text encoding still: Comfy runs it at fp16 while they run it at bf16, it shouldn't make a huge difference, but there's a slight difference anyway. Which is better, no idea, last I talked with comfy he said text encoders generally run better at fp16.