Open MushroomFleet opened 1 month ago
on the suggestion from others in the github discussions here, added the 7.0 CFG example:
until i make a video on it, this is the best i could do :)
If anybody was struggling to find complex video prompts for testing, i created this LLM template to use vision to look at an image and give us a nice prompt that will work well with video generators. This is an open template tuned for CogVideo, but it works well with Mochi also, you might want to customise it somewhat, but it serves the purpose well and if you do not use an LLM like Llava-OneVision-Qwen2 in comfyUI, you can just use this prompt with Claude/ChatGPT and the image of your choice;
template:
Analyze the given image and generate a detailed video prompt by: describing the main subject or action; detailing the setting and background; noting significant visual elements, colors, and textures; including information about lighting, time of day, or weather conditions if relevant; mentioning any movement, progression, or change in the scene; describing the mood or atmosphere; including sensory details beyond visual elements when appropriate; using vivid, descriptive language; maintaining a neutral tone without subjective interpretations; keeping the description under 256 tokens for clarity and conciseness; structure the output as follows: [Main subject/action]. [Setting and background details]. [Visual elements, colors, textures]. [Lighting, time, weather]. [Movement or progression]. [Mood/atmosphere]. [Additional sensory details if applicable].
hope that helps!
Donut-Mochi-848x480-batch16-v6
decoder params: frame_batch_size = 16 tile_sample_min_width = 144 tile_sample_min_height = 80 tile_overlap_factor_height = 0.3 tile_overlap_factor_width = 0.3
after some others in these threads posted about increasing CFG to 7.0, i carried on testing until discovering that the SD3.5 T5XXL_FP8_e4m3fn_scaled CLIP model creates excellent videos with a small cost for speed. (25 minutes)
https://civitai.com/posts/8379628
so testing continues, but it seems that this is the best quality setup i have found running on 4090. because i have good CPU and 64gb RAM, next i'll be forcing the T5XXL_FP16 onto CPU. (extra models nodes)
Results will be posted up soon as i have enough to showcase, with the workflows included in the pack.
Results Gallery here: https://civitai.com/posts/8414644 (19 videos)
I tested the current setup with: GGUF Q8_0 + T5XXL_FP16 (forced CLIP onto CPU) full configs available in the workflow, find this in the pack from version 5 onwards
\GGUF-Q8_0--T5-FP16-CPU\Donut-Mochi-848x480-GGUF-Q8_0-CPU_T5-FP16-v14.json
I have included all the experimental workflows and I'm only using 50 steps with pytorch SDP. There are 3 100 step videos in the gallery, but anyone can add more steps to gain some quality at the expense of time taken.
24 Minutes for 6 Seconds at 24FPS on RTX 4090.
Very happy with the results, Thanks Kijai for all your work on the Wrapper, I hope these configs help others to get the most from this local video model !
NOTE: I use a prompt list node i wrote to feed my prompts into the CLIP Text Encoder, you can simply remove this if you like, but the comfyui manager will fetch my nodes, and you can add your own prompt lists, it saves me so much time so it's up to you.
@MushroomFleet your gallery shows pretty strong ghosting. This is caused by tiled VAE decoding. You can experiment with settings by saving some latents (use SaveLatent node from Comfy core), then move them from "output" to "input" directory, then you can load them (LoadLatent node) in a separate workflow and only work on the "Mochi Decode" node settings. I've tried generating short videos, which can be decoded without tiling and the ghosting disappears completely:
https://github.com/user-attachments/assets/e51e746f-1f59-481f-9916-85a674d63d7b
Compare to tiled decoding here:
https://github.com/user-attachments/assets/834b5de3-186c-4836-b41e-416eafddfe41
@MushroomFleet your gallery shows pretty strong ghosting. This is caused by tiled VAE decoding. You can experiment with settings by saving some latents (use SaveLatent node from Comfy core), then move them from "output" to "input" directory, then you can load them (LoadLatent node) in a separate workflow and only work on the "Mochi Decode" node settings. I've tried generating short videos, which can be decoded without tiling and the ghosting disappears completely:
Mochi_preview_00008.mp4 Compare to tiled decoding here:
Mochi_preview_00007.mp4
This has been a community led research effort, so i like to open source my findings in the spirit of progress! Thanks so much for the tip on this, i think kijai mentioned it somewhere, but i forgot all about it :)
I'll add this to my testing list - Thanks again !! I only just started with testing 25 frame jobs, so this will be a bigtime improvement
Big Thank !
I ran into an issue using OneVision with my LLM prompt above, as it would generate 512 tokens, setting it to 256 did not limit it, so i adjust the prompt above to reflect the limit, seems to work great.
issue here: https://github.com/kijai/ComfyUI-MochiWrapper/issues/34 not sure if anything can be done, will experiment with prompt scheduling to potentially swap out prompts with step count next.
As i automated the image to prompt using "Llama-OneVision-Qwen2" using the prompt above, I decided that i wanted to have a node that will automatically detect tall or wide images, then send the height and width to a resize node set to crop/fill and also to the ksampler, more testing is being done, but it works great. I used the V2 version of this node in image generation to find the best safe divisible by 64 dimensions, based on the model we are using (eg SDXL/SD3 = 1024x1024).
Selecting Mochi1 sets the default max resolution to 848x480, but at present it's only doing 16:9 or 9:16.
it may be the case that vertical video is a bad idea, but i still think that an image conformer that can automatically bring images into safe dimensions can be beneficial with the right settings.
Note: Downscale factor is the "Divisible by X" calculation exposed for experimentation.
(use SaveLatent node from Comfy core), then move them from "output" to "input" directory, then you can load them (LoadLatent node)
This works great and so i decided to build a new Latent Load nodes so we don't have to stop halfway through a workflow and copy and paste files from output to input. It seems it has been 6 months of beta for those nodes, and it is likely they will be updated - however - -
Using my Project File Path Generator Node, (renamed Save Latent in screenshot) This allowed you to organise the saving of the latent in the Output folder, I use this node all the time to organise the saving of my outputs. It can be used with the ComfyCore Save Latent node. (nothing fancy here really, just nice for organisation)
I decided to write DJZ-LoadLatent.
It's pretty much the same code as the one in Comfy Core, however i made a few changes;
DJZ Latent Load Node:
Here i have the Decoder stage disabled while i generate .latent files (saved to output folder with my path node)
Here i have the Sampler stage disabled while i use the decoder, with the DJZ-LatentLoad node.
NOTE: sampler params in screenshots are bogus, too few steps and frames to speed up testing. This means your .latent files are kept in the same path structure as your video outputs would normally be.
ok, so the first run took 585 seconds (it's downloading all the models etc), but it's good, if you use network storage all that time is only happening on the first run.
What am I Doing ?? I'm using Runpod only to cook my latents into video
Why ?? Because i can do the part where we sample the .latent to disk on my PC over night (saves time in the runpod), if it takes 10 minutes to save the .latent and only 30 secs to cook .latent into video, this is a HUGE saving and you can't run out of memory with 48GB
https://runpod.io/console/deploy?template=egyeo55x8w&ref=0czffee4 ComfyUI Mochi Runpod template
I'm using it like this:
first run takes a long time, so don't forget to connect your network storage to escape that wait time on second run
this way you can take advantage of the power of the L40 to crush those latents into Video in record time, reducing the cost of the Runpod session by 95%
https://i.gyazo.com/d8b3f4e54b0b8642f3e430cfa3b2f2d3.mp4 <- decoding process takes 15 seconds.
I created 48 videos overnight, so there will be a large gallery in my next report.
V6 Workflow pack gallery https://civitai.com/posts/8455626 <- 20 videos
Doing it this way saved hours of time from the Runpod costs, thanks to everyone involved !! I've added my results from the last post here. Enjoy !
Still more to test so watch this space
Roundup of the research so far, with some more detailed instructions/info https://www.youtube.com/watch?v=DYbSOJrOAqE
I'm not sure if you noticed, but I have added different way of doing the tiled decoding which seems to work better, with 8x8 tiles I can do 163 frames within 24GB without any temporal tiling if I use 8x8 tiles:
The node has the option to do the decoding in chunks (per_batch), but it should be a last resort as it causes the stuttering/frame skipping since the VAE is not meant to be used in chunks.
https://github.com/user-attachments/assets/23cee149-34b4-46e2-b8ed-9cc96e7c7532
I'm not sure if you noticed, but I have added different way of doing the tiled decoding which seems to work better, with 8x8 tiles I can do 163 frames within 24GB without any temporal tiling if I use 8x8 tiles:
The node has the option to do the decoding in chunks (per_batch), but it should be a last resort as it causes the stuttering/frame skipping since the VAE is not meant to be used in chunks.
mochi_00171.mp4
https://github.com/kijai/ComfyUI-MochiWrapper/issues/50 haha i just made this as i returned today and could not find that node anywhere, i updated the wrapper and comfy so i'm a bit confused now :D
I think i solved that, for some reason it would not update as it thought i had changed the files my end/. reinstall nodes solved.
EDIT: Outstanding results with the VAE Spatial Tiling node, i'm queuing up a load of tests tonight - should make for a nice update ! Thanks again for all your hard work Kijai !!
Updated my workflows pack to use the newer VAE Spatial Tiling Decoder ! This can run 100% on local GPU, and all the demo videos in the gallery used only 50 steps (100 steps used in the V6 gallery) another significant upgrade 💯
https://civitai.com/posts/8613066 <- V7 gallery
I wrote a tool called Shuffle Video Studio, it can convert (matching dimension) outputs to the format it reads as split clips, for shuffling and joining into montage videos = 100% AI video + 100% AI Editing + 100% AI Music = Let's Go !!
I noticed there is a VAE encoder now, this should make video to video possible? maybe
until i have some results - here is a showcase video built with ACE-HoloFS video PromptGen, and used mostly a single Mochi Prompt, 100 steps for 3 seconds per segment, showcasing the outputs we can acheive easily on local GPU !
"Golden Son" Mochi1-Preview AMV https://www.youtube.com/watch?v=xmJI6aKd9P0
Ok so my i2v workflow was built with future proofing this feature in mind, i had hoped we would get an encoder, so i'm glad it's here !!
This is my image input:
I use my Image Size Adjuster V3 with Mochi Option selected to ensure the best quality at the required native resolution for Mochi (848x480) in 16:9 AR:
which sends this Fill/Crop resized image into the VAE Encoder:
from there we can use Llava-OneVision-Qwen2 to interrogate the image:
with this prompt:
Create video prompt <256 tokens: [MOTION: main vector/speed] + [DYNAMICS: key element movements] + [ENVIRONMENT: background motion] + [FLOW: scene progression] + [VISUALS: motion-relevant details] + [DIRECTION: momentum cues], maintain source image coherence.
This can still result in a prompt of over 256 tokens, so i provded a manual prompt section, where i am using this to drive the image2video:
[Static to dynamic scene]: Three figures - armored warrior(L), seated robed man(C), angelic figure(R). Motion: warrior leans forward with spear, robed man's hands gesture expressively during conversation, angel remains still observing. Lighting subtly shifts, colors remain vivid (pink/gold armor, blue robe, white dress). Focus on facial expressions and hand movements progressing through dialogue.
then i am using the following setup to get my outputs: Encoding settings:
workflow showing the two separate stages (Fast Group Bypass Switched)
Decoder settings:
just like all good i2i workflows the denoise will determine the model strength for the image to video. at 1.00 it will be prompting the model like text2image, so you will see no similarity to the image input at 0.75 (default i2i denoise) you can see some similarity, but a strong bias to the model weights at 0.5 it should be a balance 50/50 with the model (prompt) and the image input at 0.25 we should see a stronger bias to the input image, however now the model strength is very weak.
in my testing i found that the 0.5 value was excellent for animating the image input: 0.75 = https://i.gyazo.com/0454f9a92f681aa1e7b54fba6710bac5.mp4 0.5 = https://i.gyazo.com/aa0a6b4c4a4aeb3b1473791d5124cea0.mp4
0.5 matched the input image very well, however the actual motion you can get is a case of "your mileage may vary"
options to increase motion are "change the prompt" &/OR "allow more denoise"
I hope i have wrangled the OneVision to play nice and only output 256 tokens max, however it does occasionally break the rules, so there is a primitive node, which can be used to copy the offending vision prompt, (should encourage motion) and cut it down a bit for use. You can remove the OneVision if you like, it's purely used as a Vision Interrogator with LLM features to create motion from stills:
pushed new image to video workflow using the latest VAE Encoder method to my Master Workflow pack: https://github.com/MushroomFleet/DJZ-Workflows/tree/main/Donut-Mochi-Video/True-Image-To-Video
To aid people who use the .latent save node, I have completed my "DJZ Load Latent V2" Node
This will allow incremental batch decoding for anyone who finds this useful
I tried the true i2v workflow, but I don't think that is what people think of as "img2vid." This seems to be using the image as a base for every frame in the video, so it is a quite static video with just some moving parts. Whereas what we really want is to use an image as the first and/or last frame, and let the model convert that to natural video.
I tried the true i2v workflow, but I don't think that is what people think of as "img2vid." This seems to be using the image as a base for every frame in the video, so it is a quite static video with just some moving parts. Whereas what we really want is to use an image as the first and/or last frame, and let the model convert that to natural video.
This is giving you a VAE image encoder, that is the definition of "image to video".
Things you can try: turn the denoise to between 0.5 and 0.75, that is i2i defaults, as i explained - it will work, but you can't expect it to be automatic ;)~ You might also experiment with altering the CFG and step count to gain more motion, if your prompt is no good you will get zero motion.
Also some images are "blind spots" ie. they are not represented well in the corpus of training data. This is the same reason why some prompts produce bad results, the model cannot know things it was never taught. Mochi excels at realistic video generation - and is less apt with surreal/abstract types, this is a challenge for model engineers to solve.
I trained my prompt generation from the example prompts given for CogVideo - so it's doing a good job, but as stated in this reply - there is no one size fits all solution.
"first and last frame" feature is really a RunwayML Programmatic solution, and part of their cloud service...
People think of "Rollerblades" as "inline skates", but "Rollerblade" is a company brand name. Hoovers = Vacuum cleaners, so what is a Dyson? etc. i2v is what is is, image to video.
There are many custom nodes that can get you the feature you want in comfyui. "scheduling nodes" might give you the potential for "last frame" although timestep tricks can also get you there (running generations in reverse).
first and last frame is a "solution" to the challenge/problem, not replacing the technical definition of "i2v". i2i is a VAE encoder, i2v is also using a VAE encoder.
But is the sampler doing img2img for every frame of the video, or just giving it the encoded image + noise to begin the sampling, and then lets it free?
first thanks for all your hard work, you mentioned needing help from community in research on best params for configs;
I wrote up this article: https://civitai.com/articles/8313 and released this workflow pack: https://civitai.com/models/886896
contains default workflow, with 8 variants showing different Tiling/Batch configs.
I'm investigating the difference with Tiling and batch results. I included my outputs, (labelled) in the Gallery for the V1 workflow pack release, which included many variations on the manual batch/tiling.
I am Interested to know if there is a better place i can share this work with you, as this is not a bug report per se. I think that batch 16 with 0.3 overlap is nice, but i will be doing much more testing!
Thanks for all the hard work !! I really appreciated that the Sample kept in memory allowing tweaks to the Decoder only requiring 17 seconds to complete the tiling with same seed operation :)