kijai / ComfyUI-CogVideoXWrapper

800 stars 47 forks source link

Question about decoder #85

Closed thelemuet closed 1 month ago

thelemuet commented 1 month ago

Apologies if this is a stupid question, I have pretty much zero knowledge about how video models work compared to image models.

Is the way the sampler and decoder currently work somehow similar to generating images in batches with SA or Flux? If so, would it be theoretically possible to unbatch the latents or specify which frame(s) to send to the decoder by index?

Or am I totally off the mark and the decoder already handles this already, and doesn't decode all frames at once.

The reason I got curious about this is because with Fun-xB models (I assume probably caused by the models themselves, not the wrapper) I noticed that it will output extra/duplicated frames at the beginning of image sequence (ie: it outputs 16 frames when selecting 13 frames as length), so I was wondering if it was possible to remove those before sending to the decoder, thus somehow saving on a bit of resources along the way when decoding.

I tried a few different nodes to separate latents from batch/select by index but obviously it doesn't work ;)

kijai commented 1 month ago

CogVideoX uses a 3D VAE, meaning it also compresses the images temporally: 4 images into one latent. This causes the disparancies you noticed with the frame counts. The decoding is done 2 "frames" at the time by design, it can't be less. If memory for decoding is a concern, the VAE tiling works pretty well with the tile size set to half of the dimensions.

thelemuet commented 1 month ago

Ah, makes completes sense, thank you for the explanation.

Funnily enough, another reason I was messing with the latents is because yesterday I had issues with the VAE tiling resulting in very obvious seams. As an alternative I was splitting the latents with some padding using the core "crop latent" node before sending to the decoder, then stitching the images back after that myself. Clearly the latent shape was wrong, cropping the Width was cropping the Height and cropping the Height did nothing, but it did actually work even if not as intended hehe,

But looks like there are no more visible seams with VAE tiling after updating today, so thank you, definitely much more convenient ;)

kijai commented 1 month ago

Ah, makes completes sense, thank you for the explanation.

Funnily enough, another reason I was messing with the latents is because yesterday I had issues with the VAE tiling resulting in very obvious seams. As an alternative I was splitting the latents with some padding using the core "crop latent" node before sending to the decoder, then stitching the images back after that myself. Clearly the latent shape was wrong, cropping the Width was cropping the Height and cropping the Height did nothing, but it did actually work even if not as intended hehe,

But looks like there are no more visible seams with VAE tiling after updating today, so thank you, definitely much more convenient ;)

Yes they defaults were just awful before, I got the values from the initial code before these new models and as I never really used it myself, I didn't realise they should be completely different. 96x96 tiles made no sense in pixel space especially, half of the image dimension seems fine now and no seams with 0.2 overlap.