Example: no lip sync - Githubissues

samhodge-aiml commented 2 weeks ago

This is the output

https://github.com/user-attachments/assets/efca7156-0444-49cd-ba67-a3805df07723

From this input

https://github.com/user-attachments/assets/ac648752-ca61-4f26-b1fe-cf3097ab56c2

using this script

photoreal_78_frame.json

What could make the lipsync work, how could I get the same look for the duration of the video?

Sam

logtd commented 2 weeks ago

Hi, thanks for your interest.

The facial expressions you're trying to reproduce are relatively small and may require lots of step to be encoded, you may need 100-200 steps on the unsampler and resampler.

If you're making multiple clips there's no way to make sure the video is cohesive. Currently Mochi supports ~200 frames (or something like that), but MochiEdit only can handle ~43 frames. When I get time I plan to allow it to support the same amount as base Mochi.

samhodge commented 2 weeks ago

Thanks for the feedback I will up it to 250 steps and let it cook overnight and keep the batch size at 43 as you suggested

samhodge-aiml commented 2 weeks ago

https://github.com/user-attachments/assets/4e8848da-0792-49f2-b07b-a7a75b2624fe

I am still unable to get lipsync with 300 steps, I will try again overnight with 600 steps and see if that works.

tuckerdarby commented 1 week ago

I don't think anything past 200 will help. The movement from the mouth talking is really small especially at Mochi's latent size so it may not be possible. For example Mochi compressed the image spatially by 8x. So the mouth movement will be very small.

samhodge-aiml commented 1 week ago

Working with 600 steps was interupted.

Is there anything to stop Mochi Edit from the 8x downsample?

Sam

logtd commented 1 week ago

No, all latent diffusion models will perform some kind of spatial downsampling to convert the image into latent space and they're only trained on latent space.

samhodge-aiml commented 1 week ago

Thanks for the education, I can think about the impact of style transfer with this given limitation.

samhodge-aiml commented 1 week ago

https://github.com/user-attachments/assets/6ecd574d-0792-41c9-aef3-03bd352993aa

Trying again with this prompt

photography of a european woman wearing a grey green turtleneck long sleeved top with shoulder length auburn hair in a fringe and a bun with hair below the bun, thick eyebrows and green grey eyes without make and a european man with dark brown short hair clean shaven wearing a green tshirt under a black suede vest with lapels, studs and fringing around the armholes, the man and woman are talking in a baroque room with wooden panelling, the room is decorated with a suit of armour and swords and scabbared in the left corner of the room is a stairway leading upward. The lighting in the room is dim at nighttime.