logtd / ComfyUI-MochiEdit

ComfyUI nodes to edit videos using Genmo Mochi
GNU General Public License v3.0
254 stars 14 forks source link

Example: no lip sync #12

Open samhodge-aiml opened 2 weeks ago

samhodge-aiml commented 2 weeks ago

This is the output

https://github.com/user-attachments/assets/efca7156-0444-49cd-ba67-a3805df07723

From this input

https://github.com/user-attachments/assets/ac648752-ca61-4f26-b1fe-cf3097ab56c2

using this script

photoreal_78_frame.json

What could make the lipsync work, how could I get the same look for the duration of the video?

Sam

logtd commented 2 weeks ago

Hi, thanks for your interest.

The facial expressions you're trying to reproduce are relatively small and may require lots of step to be encoded, you may need 100-200 steps on the unsampler and resampler.

If you're making multiple clips there's no way to make sure the video is cohesive. Currently Mochi supports ~200 frames (or something like that), but MochiEdit only can handle ~43 frames. When I get time I plan to allow it to support the same amount as base Mochi.

samhodge commented 2 weeks ago

Thanks for the feedback I will up it to 250 steps and let it cook overnight and keep the batch size at 43 as you suggested

samhodge-aiml commented 2 weeks ago

https://github.com/user-attachments/assets/4e8848da-0792-49f2-b07b-a7a75b2624fe

I am still unable to get lipsync with 300 steps, I will try again overnight with 600 steps and see if that works.

tuckerdarby commented 1 week ago

I don't think anything past 200 will help. The movement from the mouth talking is really small especially at Mochi's latent size so it may not be possible. For example Mochi compressed the image spatially by 8x. So the mouth movement will be very small.

samhodge-aiml commented 1 week ago

Working with 600 steps was interupted.

Is there anything to stop Mochi Edit from the 8x downsample?

Sam

logtd commented 1 week ago

No, all latent diffusion models will perform some kind of spatial downsampling to convert the image into latent space and they're only trained on latent space.

samhodge-aiml commented 1 week ago

Thanks for the education, I can think about the impact of style transfer with this given limitation.

samhodge-aiml commented 1 week ago

https://github.com/user-attachments/assets/6ecd574d-0792-41c9-aef3-03bd352993aa

Trying again with this prompt

photography of a european woman wearing a grey green turtleneck long sleeved top with shoulder length auburn hair in a fringe and a bun with hair below the bun, thick eyebrows and green grey eyes without make and a european man with dark brown short hair clean shaven wearing a green tshirt under a black suede vest with lapels, studs and fringing around the armholes, the man and woman are talking in a baroque room with wooden panelling, the room is decorated with a suit of armour and swords and scabbared in the left corner of the room is a stairway leading upward. The lighting in the room is dim at nighttime.

samhodge-aiml commented 6 days ago

new prompt, the former was too long

4k photography of a young woman with auburn hair wearing a greenish-grey long sleeved turtle neck talking to a young man clean shaven with short dark hair wearing a green shirt under a sleeveless black suede jacket in a cellar with wood panelling and a suit of armour and a winding carpeted wooden staircase

samhodge-aiml commented 6 days ago

https://github.com/user-attachments/assets/67d11c0a-12e5-4eb1-8c3e-6d0951bec878

Getting a lot of low quality results similar to the attached, image, it looks like it hasn't properly denoised the diffused image.

What is the solution for a result like this?

samhodge-aiml commented 6 days ago

https://github.com/user-attachments/assets/9ce55474-c39f-4e03-bf6c-1e0536a1a6e9

This is more of the result I was hoping for.

samhodge-aiml commented 6 days ago

from this JSON

current_attempt.json

I produced the following clips, which I laid out in order and added the audio track

https://github.com/user-attachments/assets/d38a9cf6-9f9b-443e-b9c8-41b00ff11418

there seems to be no coherencce with the audio even though the faces are much bigger.

samhodge-aiml commented 6 days ago

positive prompt

4k photography of a young woman with auburn hair wearing a greenish-grey long sleeved turtle neck talking to a young man clean shaven with short dark hair wearing a green shirt under a sleeveless black suede jacket in a cellar with wood panelling and a suit of armour and a winding carpeted wooden staircase

negative prompt

flash photography, cartoon, computer graphics, drawing, animation, anime