Open ptits opened 3 weeks ago
seed = 0 prompt Two white women with long, flowing blonde hair walking side by side along a wide, sandy beach on a bright, sunny day. Both are mid-laugh, their expressions full of joy and friendship, as they walk in sync, close together, barefoot on the warm sand. The sunlight casts a golden glow over their hair, which flows slightly in
The text encoder is the same one. You don't have negative prompt, they add the prompts I have as default always:
it looks like text encoding is here
girls are moving toward camera with any seed in comfy and away from camera in gradio with any seed and cfg
it looks like text encoding is here
girls are moving toward camera with any seed in comfy and away from camera in gradio with any seed and cfg
Just tried your prompt + the defaults, and every single seed they walk away from camera, even when prompted otherwise (which is pretty weird)
https://github.com/user-attachments/assets/b5575935-46ee-44d0-a945-2fefcfa7eec3
Do they use DualClipLoader technics?
lady smiling and tilt her head
they are similar - but quality...
https://github.com/user-attachments/assets/7ecdab1f-b3d7-4380-a2df-dee3e090b9f7
https://github.com/user-attachments/assets/a90cb0f9-72bc-490b-927e-735bc028c097
I'm not sure you are understanding me, they always add those extra prompts, to match the results closer you prompt like this:
https://github.com/user-attachments/assets/1f4f1279-db40-473e-a66d-1644b97c9642
Do they use DualClipLoader technics?
Yes.
yes, you are absolutely correct, my bad
I add positive addition and negative prompt from their actual code from pyramid_dit_for_video_gen_pipeline.py
and got very similar results thank you very much for your patience and explanations
video movements are slightly different but quality is much the same
https://github.com/user-attachments/assets/1d791509-1316-4205-b1f0-c018e207229b
https://github.com/user-attachments/assets/0f4b5e60-83bb-48ce-89be-97f85fc9747d
seed 0
There is actually one small difference to the text encoding still: Comfy runs it at fp16 while they run it at bf16, it shouldn't make a huge difference, but there's a slight difference anyway. Which is better, no idea, last I talked with comfy he said text encoders generally run better at fp16.
I see difference in setup in text encoders
I use t5xxl and clip_l with Dual Clip Loader from your example
they use they own text encoders(two files, i guess the same) and openai/clip-vit-large-patch14
what can cause such a difference?
gradio demo give much better quality