Open samhodge-aiml opened 2 weeks ago
Hi, thanks for your interest.
The facial expressions you're trying to reproduce are relatively small and may require lots of step to be encoded, you may need 100-200 steps on the unsampler and resampler.
If you're making multiple clips there's no way to make sure the video is cohesive. Currently Mochi supports ~200 frames (or something like that), but MochiEdit only can handle ~43 frames. When I get time I plan to allow it to support the same amount as base Mochi.
Thanks for the feedback I will up it to 250 steps and let it cook overnight and keep the batch size at 43 as you suggested
https://github.com/user-attachments/assets/4e8848da-0792-49f2-b07b-a7a75b2624fe
I am still unable to get lipsync with 300 steps, I will try again overnight with 600 steps and see if that works.
I don't think anything past 200 will help. The movement from the mouth talking is really small especially at Mochi's latent size so it may not be possible. For example Mochi compressed the image spatially by 8x. So the mouth movement will be very small.
Working with 600 steps was interupted.
Is there anything to stop Mochi Edit from the 8x downsample?
Sam
No, all latent diffusion models will perform some kind of spatial downsampling to convert the image into latent space and they're only trained on latent space.
Thanks for the education, I can think about the impact of style transfer with this given limitation.
https://github.com/user-attachments/assets/6ecd574d-0792-41c9-aef3-03bd352993aa
Trying again with this prompt
photography of a european woman wearing a grey green turtleneck long sleeved top with shoulder length auburn hair in a fringe and a bun with hair below the bun, thick eyebrows and green grey eyes without make and a european man with dark brown short hair clean shaven wearing a green tshirt under a black suede vest with lapels, studs and fringing around the armholes, the man and woman are talking in a baroque room with wooden panelling, the room is decorated with a suit of armour and swords and scabbared in the left corner of the room is a stairway leading upward. The lighting in the room is dim at nighttime.
new prompt, the former was too long
4k photography of a young woman with auburn hair wearing a greenish-grey long sleeved turtle neck talking to a young man clean shaven with short dark hair wearing a green shirt under a sleeveless black suede jacket in a cellar with wood panelling and a suit of armour and a winding carpeted wooden staircase
https://github.com/user-attachments/assets/67d11c0a-12e5-4eb1-8c3e-6d0951bec878
Getting a lot of low quality results similar to the attached, image, it looks like it hasn't properly denoised the diffused image.
What is the solution for a result like this?
https://github.com/user-attachments/assets/9ce55474-c39f-4e03-bf6c-1e0536a1a6e9
This is more of the result I was hoping for.
from this JSON
I produced the following clips, which I laid out in order and added the audio track
https://github.com/user-attachments/assets/d38a9cf6-9f9b-443e-b9c8-41b00ff11418
there seems to be no coherencce with the audio even though the faces are much bigger.
positive prompt
4k photography of a young woman with auburn hair wearing a greenish-grey long sleeved turtle neck talking to a young man clean shaven with short dark hair wearing a green shirt under a sleeveless black suede jacket in a cellar with wood panelling and a suit of armour and a winding carpeted wooden staircase
negative prompt
flash photography, cartoon, computer graphics, drawing, animation, anime
This is the output
https://github.com/user-attachments/assets/efca7156-0444-49cd-ba67-a3805df07723
From this input
https://github.com/user-attachments/assets/ac648752-ca61-4f26-b1fe-cf3097ab56c2
using this script
photoreal_78_frame.json
What could make the lipsync work, how could I get the same look for the duration of the video?
Sam