ayaan-together commented 3 months ago

I'm running the reconstruction notebook and I checked the quantize_motion=True in model.reconstruct_from_token and I get this

frames =  model.reconstruct_from_token(keyframe.to("cuda"), motions.to("cuda"), decode_chunk_size=8,

File "/home/ayaan/miniconda3/envs/lavit/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/home/ayaan/LaVIT/VideoLaVIT/models/video_detokenizer.py", line 173, in reconstruct_from_token motion = self.motion_tokenizer.reconstruct(motion, height // self.vae_scale_factor, width // self.vae_scale_factor) File "/home/ayaan/miniconda3/envs/lavit/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/home/ayaan/LaVIT/VideoLaVIT/models/modeling_motion_tokenizer.py", line 389, in reconstruct quantize, embed_ind = self.encode(x) File "/home/ayaan/LaVIT/VideoLaVIT/models/modeling_motion_tokenizer.py", line 431, in encode encoder_features = self.encoder(x) File "/home/ayaan/miniconda3/envs/lavit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/ayaan/miniconda3/envs/lavit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/ayaan/LaVIT/VideoLaVIT/models/modeling_motion_tokenizer.py", line 236, in forward x = x + pos_embed RuntimeError: The size of tensor a (2880) must match the size of tensor b (720) at non-singleton dimension 2

How can i get the tokenizer working?

jy0205 commented 3 months ago

I am sorry, I can not examine the bugs from your given context. Can you provide more details about the modification to the code? Or can you run the original notebook for video reconstruction?

ayaan-together commented 3 months ago

The original notebook for video reconstruction works. Inside video_detokenizer.py, I saw an option which is if quantize_motion:

Use the reconstructed motion as input

        motion = self.motion_tokenizer.reconstruct(motion, height // self.vae_scale_factor, width // self.vae_scale_factor)

and I just set quantize_motion=True.

I am basically trying to train this myself. I wanted to be able to extract the visual tokens and the motion tokens so I can use that as my inputs for the LLM training.

Can you guide me on how I can have a simple script extract the visual tokens and motion tokens (which I can send to the LLM) and then given visual tokens and motion tokens, I can reconstruct it. The reconstruction notebook does not reconstruct from the token-level (quantized indices) from what I can tell.

Also, when pretraining, will the input be like this: [img] (list of image tokens) [/img] [mov](list of motion tokens) [/mov] (list of text tokens) with the tokens offsetted by the tokenizer vocab or is it something else? I want to train it in this format so this would be really helpful!

And in examples where I get 3 clips (so three list of img and mov, how do I format the input? Or should I make sure I only get one clip per input video when tokenizing? How can I do that? Thanks!

ayaan-together commented 3 months ago

This error basically happens when I do get_tokens on a motion input. Within the get_tokens, it fails on encode. How can I fix it?

In the reconstruction notebook, why does doing this fail (and bring the error as stated above). I added get_tokens and decode from token statements. It fails on get_tokens:

_motions = motion_tokenizer.get_tokens(motions)['tokens']_

frames = model.reconstruct_from_token(keyframe.to("cuda"), motions.to("cuda"), decode_chunk_size=8, width=width, height=height, num_frames=24, noise_aug_strength=0.02, cond_on_ref_frame=True, use_linear_guidance=True, max_guidance_scale=3.0, min_guidance_scale=1.0, _decode_fromtoken=True)[0] output_videopath = "reconstruct.gif"

ayaan-together commented 3 months ago

So if I change the recon notebook to add this line motion_transform = MotionVectorProcessor(width=36, height=20), the error is no longer there but generation is bad. Any fix?

jy0205 / LaVIT

quantize not working for motion #34

Use the reconstructed motion as input