MushroomFleet commented 1 month ago

first thanks for all your hard work, you mentioned needing help from community in research on best params for configs;

I wrote up this article: https://civitai.com/articles/8313 and released this workflow pack: https://civitai.com/models/886896

contains default workflow, with 8 variants showing different Tiling/Batch configs.

I'm investigating the difference with Tiling and batch results. I included my outputs, (labelled) in the Gallery for the V1 workflow pack release, which included many variations on the manual batch/tiling.

I am Interested to know if there is a better place i can share this work with you, as this is not a bug report per se. I think that batch 16 with 0.3 overlap is nice, but i will be doing much more testing!

Thanks for all the hard work !! I really appreciated that the Sample kept in memory allowing tweaks to the Decoder only requiring 17 seconds to complete the tiling with same seed operation :)

MushroomFleet commented 1 month ago

on the suggestion from others in the github discussions here, added the 7.0 CFG example:

video here

until i make a video on it, this is the best i could do :)

MushroomFleet commented 1 month ago

If anybody was struggling to find complex video prompts for testing, i created this LLM template to use vision to look at an image and give us a nice prompt that will work well with video generators. This is an open template tuned for CogVideo, but it works well with Mochi also, you might want to customise it somewhat, but it serves the purpose well and if you do not use an LLM like Llava-OneVision-Qwen2 in comfyUI, you can just use this prompt with Claude/ChatGPT and the image of your choice;

template: Analyze the given image and generate a detailed video prompt by: describing the main subject or action; detailing the setting and background; noting significant visual elements, colors, and textures; including information about lighting, time of day, or weather conditions if relevant; mentioning any movement, progression, or change in the scene; describing the mood or atmosphere; including sensory details beyond visual elements when appropriate; using vivid, descriptive language; maintaining a neutral tone without subjective interpretations; keeping the description under 256 tokens for clarity and conciseness; structure the output as follows: [Main subject/action]. [Setting and background details]. [Visual elements, colors, textures]. [Lighting, time, weather]. [Movement or progression]. [Mood/atmosphere]. [Additional sensory details if applicable].

hope that helps!

MushroomFleet commented 4 weeks ago

Donut-Mochi-848x480-batch16-v6

decoder params: frame_batch_size = 16 tile_sample_min_width = 144 tile_sample_min_height = 80 tile_overlap_factor_height = 0.3 tile_overlap_factor_width = 0.3

after some others in these threads posted about increasing CFG to 7.0, i carried on testing until discovering that the SD3.5 T5XXL_FP8_e4m3fn_scaled CLIP model creates excellent videos with a small cost for speed. (25 minutes)

https://civitai.com/posts/8379628

so testing continues, but it seems that this is the best quality setup i have found running on 4090. because i have good CPU and 64gb RAM, next i'll be forcing the T5XXL_FP16 onto CPU. (extra models nodes)

Results will be posted up soon as i have enough to showcase, with the workflows included in the pack.

MushroomFleet commented 4 weeks ago

Results Gallery here: https://civitai.com/posts/8414644 (19 videos)

I tested the current setup with: GGUF Q8_0 + T5XXL_FP16 (forced CLIP onto CPU) full configs available in the workflow, find this in the pack from version 5 onwards

\GGUF-Q8_0--T5-FP16-CPU\Donut-Mochi-848x480-GGUF-Q8_0-CPU_T5-FP16-v14.json

I have included all the experimental workflows and I'm only using 50 steps with pytorch SDP. There are 3 100 step videos in the gallery, but anyone can add more steps to gain some quality at the expense of time taken.

24 Minutes for 6 Seconds at 24FPS on RTX 4090.

Very happy with the results, Thanks Kijai for all your work on the Wrapper, I hope these configs help others to get the most from this local video model !

NOTE: I use a prompt list node i wrote to feed my prompts into the CLIP Text Encoder, you can simply remove this if you like, but the comfyui manager will fetch my nodes, and you can add your own prompt lists, it saves me so much time so it's up to you.

mafik commented 4 weeks ago

@MushroomFleet your gallery shows pretty strong ghosting. This is caused by tiled VAE decoding. You can experiment with settings by saving some latents (use SaveLatent node from Comfy core), then move them from "output" to "input" directory, then you can load them (LoadLatent node) in a separate workflow and only work on the "Mochi Decode" node settings. I've tried generating short videos, which can be decoded without tiling and the ghosting disappears completely:

https://github.com/user-attachments/assets/e51e746f-1f59-481f-9916-85a674d63d7b

Compare to tiled decoding here:

https://github.com/user-attachments/assets/834b5de3-186c-4836-b41e-416eafddfe41

MushroomFleet commented 4 weeks ago

@MushroomFleet your gallery shows pretty strong ghosting. This is caused by tiled VAE decoding. You can experiment with settings by saving some latents (use SaveLatent node from Comfy core), then move them from "output" to "input" directory, then you can load them (LoadLatent node) in a separate workflow and only work on the "Mochi Decode" node settings. I've tried generating short videos, which can be decoded without tiling and the ghosting disappears completely:

Mochi_preview_00008.mp4 Compare to tiled decoding here:

Mochi_preview_00007.mp4

This has been a community led research effort, so i like to open source my findings in the spirit of progress! Thanks so much for the tip on this, i think kijai mentioned it somewhere, but i forgot all about it :)

I'll add this to my testing list - Thanks again !! I only just started with testing 25 frame jobs, so this will be a bigtime improvement

Big Thank !

MushroomFleet commented 4 weeks ago

I ran into an issue using OneVision with my LLM prompt above, as it would generate 512 tokens, setting it to 256 did not limit it, so i adjust the prompt above to reflect the limit, seems to work great.

issue here: https://github.com/kijai/ComfyUI-MochiWrapper/issues/34 not sure if anything can be done, will experiment with prompt scheduling to potentially swap out prompts with step count next.

MushroomFleet commented 4 weeks ago

As i automated the image to prompt using "Llama-OneVision-Qwen2" using the prompt above, I decided that i wanted to have a node that will automatically detect tall or wide images, then send the height and width to a resize node set to crop/fill and also to the ksampler, more testing is being done, but it works great. I used the V2 version of this node in image generation to find the best safe divisible by 64 dimensions, based on the model we are using (eg SDXL/SD3 = 1024x1024).

Selecting Mochi1 sets the default max resolution to 848x480, but at present it's only doing 16:9 or 9:16.

it may be the case that vertical video is a bad idea, but i still think that an image conformer that can automatically bring images into safe dimensions can be beneficial with the right settings.

Note: Downscale factor is the "Divisible by X" calculation exposed for experimentation.

MushroomFleet commented 4 weeks ago

(use SaveLatent node from Comfy core), then move them from "output" to "input" directory, then you can load them (LoadLatent node)

This works great and so i decided to build a new Latent Load nodes so we don't have to stop halfway through a workflow and copy and paste files from output to input. It seems it has been 6 months of beta for those nodes, and it is likely they will be updated - however - -

Using my Project File Path Generator Node, (renamed Save Latent in screenshot) This allowed you to organise the saving of the latent in the Output folder, I use this node all the time to organise the saving of my outputs. It can be used with the ComfyCore Save Latent node. (nothing fancy here really, just nice for organisation)

I decided to write DJZ-LoadLatent.

It's pretty much the same code as the one in Comfy Core, however i made a few changes;

DJZ Latent Load Node:

Press R to refresh the list
It will scan comfyui outputs
allow loading of the .latent file

Here i have the Decoder stage disabled while i generate .latent files (saved to output folder with my path node)

Here i have the Sampler stage disabled while i use the decoder, with the DJZ-LatentLoad node.

NOTE: sampler params in screenshots are bogus, too few steps and frames to speed up testing. This means your .latent files are kept in the same path structure as your video outputs would normally be.

MushroomFleet commented 3 weeks ago

ok, so the first run took 585 seconds (it's downloading all the models etc), but it's good, if you use network storage all that time is only happening on the first run.

What am I Doing ?? I'm using Runpod only to cook my latents into video

Why ?? Because i can do the part where we sample the .latent to disk on my PC over night (saves time in the runpod), if it takes 10 minutes to save the .latent and only 30 secs to cook .latent into video, this is a HUGE saving and you can't run out of memory with 48GB

https://runpod.io/console/deploy?template=egyeo55x8w&ref=0czffee4 ComfyUI Mochi Runpod template

I'm using it like this:

used local to create samples saved as .latent
starting Runpod template with L40 (48GB)
uploading all my latent files to the /output folder
loading the Donut-Mochi-848x480-t2v-LatentSideload-v25 workflow

first run takes a long time, so don't forget to connect your network storage to escape that wait time on second run

Refresh the Comfyui (press R)
choose the .latent from the list in my new DJZ-LoadLatent node
Queue it up !

this way you can take advantage of the power of the L40 to crush those latents into Video in record time, reducing the cost of the Runpod session by 95%

there will be a oneshot runpod workflow that can do it all in one go soon (for those that can't use local), but this is the cheapest solution if you can run the sampler on your local.

https://i.gyazo.com/d8b3f4e54b0b8642f3e430cfa3b2f2d3.mp4 <- decoding process takes 15 seconds.

I created 48 videos overnight, so there will be a large gallery in my next report.

MushroomFleet commented 3 weeks ago

V6 Workflow pack gallery https://civitai.com/posts/8455626 <- 20 videos

No VAE Tiling
decoding .latent files from disk
my latest t2i workflow
Saved latents overnight using RTX4090 (24GB)
Decoded latents on Runpod with L40 (48GB)

Doing it this way saved hours of time from the Runpod costs, thanks to everyone involved !! I've added my results from the last post here. Enjoy !

Still more to test so watch this space

Roundup of the research so far, with some more detailed instructions/info https://www.youtube.com/watch?v=DYbSOJrOAqE

kijai commented 3 weeks ago

I'm not sure if you noticed, but I have added different way of doing the tiled decoding which seems to work better, with 8x8 tiles I can do 163 frames within 24GB without any temporal tiling if I use 8x8 tiles:

The node has the option to do the decoding in chunks (per_batch), but it should be a last resort as it causes the stuttering/frame skipping since the VAE is not meant to be used in chunks.

https://github.com/user-attachments/assets/23cee149-34b4-46e2-b8ed-9cc96e7c7532

MushroomFleet commented 3 weeks ago

I'm not sure if you noticed, but I have added different way of doing the tiled decoding which seems to work better, with 8x8 tiles I can do 163 frames within 24GB without any temporal tiling if I use 8x8 tiles:

The node has the option to do the decoding in chunks (per_batch), but it should be a last resort as it causes the stuttering/frame skipping since the VAE is not meant to be used in chunks.

mochi_00171.mp4

https://github.com/kijai/ComfyUI-MochiWrapper/issues/50 haha i just made this as i returned today and could not find that node anywhere, i updated the wrapper and comfy so i'm a bit confused now :D

I think i solved that, for some reason it would not update as it thought i had changed the files my end/. reinstall nodes solved.

EDIT: Outstanding results with the VAE Spatial Tiling node, i'm queuing up a load of tests tonight - should make for a nice update ! Thanks again for all your hard work Kijai !!

MushroomFleet commented 3 weeks ago

Updated my workflows pack to use the newer VAE Spatial Tiling Decoder ! This can run 100% on local GPU, and all the demo videos in the gallery used only 50 steps (100 steps used in the V6 gallery) another significant upgrade 💯

https://civitai.com/posts/8613066 <- V7 gallery

I wrote a tool called Shuffle Video Studio, it can convert (matching dimension) outputs to the format it reads as split clips, for shuffling and joining into montage videos = 100% AI video + 100% AI Editing + 100% AI Music = Let's Go !!

https://www.youtube.com/watch?v=4CXV-lMXy7c

MushroomFleet commented 3 weeks ago

I noticed there is a VAE encoder now, this should make video to video possible? maybe

until i have some results - here is a showcase video built with ACE-HoloFS video PromptGen, and used mostly a single Mochi Prompt, 100 steps for 3 seconds per segment, showcasing the outputs we can acheive easily on local GPU !

"Golden Son" Mochi1-Preview AMV https://www.youtube.com/watch?v=xmJI6aKd9P0

MushroomFleet commented 3 weeks ago

Ok so my i2v workflow was built with future proofing this feature in mind, i had hoped we would get an encoder, so i'm glad it's here !!

Image VAE Encoder Research results:

This is my image input: 1920px-Alcibades_being_taught_by_Socrates2C_FranC3A7ois-AndrC3A9_Vincent (1)

I use my Image Size Adjuster V3 with Mochi Option selected to ensure the best quality at the required native resolution for Mochi (848x480) in 16:9 AR:

which sends this Fill/Crop resized image into the VAE Encoder:

from there we can use Llava-OneVision-Qwen2 to interrogate the image:

with this prompt: Create video prompt <256 tokens: [MOTION: main vector/speed] + [DYNAMICS: key element movements] + [ENVIRONMENT: background motion] + [FLOW: scene progression] + [VISUALS: motion-relevant details] + [DIRECTION: momentum cues], maintain source image coherence.

This can still result in a prompt of over 256 tokens, so i provded a manual prompt section, where i am using this to drive the image2video: [Static to dynamic scene]: Three figures - armored warrior(L), seated robed man(C), angelic figure(R). Motion: warrior leans forward with spear, robed man's hands gesture expressively during conversation, angel remains still observing. Lighting subtly shifts, colors remain vivid (pink/gold armor, blue robe, white dress). Focus on facial expressions and hand movements progressing through dialogue.

then i am using the following setup to get my outputs: Encoding settings:

workflow showing the two separate stages (Fast Group Bypass Switched)

Decoder settings:

just like all good i2i workflows the denoise will determine the model strength for the image to video. at 1.00 it will be prompting the model like text2image, so you will see no similarity to the image input at 0.75 (default i2i denoise) you can see some similarity, but a strong bias to the model weights at 0.5 it should be a balance 50/50 with the model (prompt) and the image input at 0.25 we should see a stronger bias to the input image, however now the model strength is very weak.

in my testing i found that the 0.5 value was excellent for animating the image input: 0.75 = https://i.gyazo.com/0454f9a92f681aa1e7b54fba6710bac5.mp4 0.5 = https://i.gyazo.com/aa0a6b4c4a4aeb3b1473791d5124cea0.mp4

0.5 matched the input image very well, however the actual motion you can get is a case of "your mileage may vary"

options to increase motion are "change the prompt" &/OR "allow more denoise"

MushroomFleet commented 3 weeks ago

I hope i have wrangled the OneVision to play nice and only output 256 tokens max, however it does occasionally break the rules, so there is a primitive node, which can be used to copy the offending vision prompt, (should encourage motion) and cut it down a bit for use. You can remove the OneVision if you like, it's purely used as a Vision Interrogator with LLM features to create motion from stills:

MushroomFleet commented 3 weeks ago

pushed new image to video workflow using the latest VAE Encoder method to my Master Workflow pack: https://github.com/MushroomFleet/DJZ-Workflows/tree/main/Donut-Mochi-Video/True-Image-To-Video

MushroomFleet commented 2 weeks ago

To aid people who use the .latent save node, I have completed my "DJZ Load Latent V2" Node

This will allow incremental batch decoding for anyone who finds this useful

I had spent a lot of hours doing this one by one, and created this by popular request to make life easier !
the list is not a selection, only shows the numbered list so you can know exactly where to start, where you had a lot of latents.

Jonseed commented 2 weeks ago

I tried the true i2v workflow, but I don't think that is what people think of as "img2vid." This seems to be using the image as a base for every frame in the video, so it is a quite static video with just some moving parts. Whereas what we really want is to use an image as the first and/or last frame, and let the model convert that to natural video.

MushroomFleet commented 2 weeks ago

I tried the true i2v workflow, but I don't think that is what people think of as "img2vid." This seems to be using the image as a base for every frame in the video, so it is a quite static video with just some moving parts. Whereas what we really want is to use an image as the first and/or last frame, and let the model convert that to natural video.

This is giving you a VAE image encoder, that is the definition of "image to video".

Things you can try: turn the denoise to between 0.5 and 0.75, that is i2i defaults, as i explained - it will work, but you can't expect it to be automatic ;)~ You might also experiment with altering the CFG and step count to gain more motion, if your prompt is no good you will get zero motion.

Good Prompt + Good Params = good motion. This is the same for any Video generation platform, be it local or in a cloud service.

Also some images are "blind spots" ie. they are not represented well in the corpus of training data. This is the same reason why some prompts produce bad results, the model cannot know things it was never taught. Mochi excels at realistic video generation - and is less apt with surreal/abstract types, this is a challenge for model engineers to solve.

I trained my prompt generation from the example prompts given for CogVideo - so it's doing a good job, but as stated in this reply - there is no one size fits all solution.

"first and last frame" feature is really a RunwayML Programmatic solution, and part of their cloud service...

People think of "Rollerblades" as "inline skates", but "Rollerblade" is a company brand name. Hoovers = Vacuum cleaners, so what is a Dyson? etc. i2v is what is is, image to video.

There are many custom nodes that can get you the feature you want in comfyui. "scheduling nodes" might give you the potential for "last frame" although timestep tricks can also get you there (running generations in reverse).

first and last frame is a "solution" to the challenge/problem, not replacing the technical definition of "i2v". i2i is a VAE encoder, i2v is also using a VAE encoder.

Jonseed commented 2 weeks ago

But is the sampler doing img2img for every frame of the video, or just giving it the encoded image + noise to begin the sampling, and then lets it free?

kijai / ComfyUI-MochiWrapper

just contributing results #26

Image VAE Encoder Research results: