Open ayaan-together opened 3 months ago
I am sorry, I can not examine the bugs from your given context. Can you provide more details about the modification to the code? Or can you run the original notebook for video reconstruction?
The original notebook for video reconstruction works. Inside video_detokenizer.py, I saw an option which is if quantize_motion:
motion = self.motion_tokenizer.reconstruct(motion, height // self.vae_scale_factor, width // self.vae_scale_factor)
and I just set quantize_motion=True.
I am basically trying to train this myself. I wanted to be able to extract the visual tokens and the motion tokens so I can use that as my inputs for the LLM training.
Can you guide me on how I can have a simple script extract the visual tokens and motion tokens (which I can send to the LLM) and then given visual tokens and motion tokens, I can reconstruct it. The reconstruction notebook does not reconstruct from the token-level (quantized indices) from what I can tell.
Also, when pretraining, will the input be like this: [img] (list of image tokens) [/img] [mov](list of motion tokens) [/mov] (list of text tokens)
And in examples where I get 3 clips (so three list of img and mov, how do I format the input? Or should I make sure I only get one clip per input video when tokenizing? How can I do that? Thanks!
This error basically happens when I do get_tokens on a motion input. Within the get_tokens, it fails on encode. How can I fix it?
In the reconstruction notebook, why does doing this fail (and bring the error as stated above). I added get_tokens and decode from token statements. It fails on get_tokens:
_motions = motion_tokenizer.get_tokens(motions)['tokens']_
frames = model.reconstruct_from_token(keyframe.to("cuda"), motions.to("cuda"), decode_chunk_size=8, width=width, height=height, num_frames=24, noise_aug_strength=0.02, cond_on_ref_frame=True, use_linear_guidance=True, max_guidance_scale=3.0, min_guidance_scale=1.0, _decode_fromtoken=True)[0] output_videopath = "reconstruct.gif"
So if I change the recon notebook to add this line motion_transform = MotionVectorProcessor(width=36, height=20), the error is no longer there but generation is bad. Any fix?
I'm running the reconstruction notebook and I checked the quantize_motion=True in model.reconstruct_from_token and I get this
File "/home/ayaan/miniconda3/envs/lavit/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/home/ayaan/LaVIT/VideoLaVIT/models/video_detokenizer.py", line 173, in reconstruct_from_token motion = self.motion_tokenizer.reconstruct(motion, height // self.vae_scale_factor, width // self.vae_scale_factor) File "/home/ayaan/miniconda3/envs/lavit/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/home/ayaan/LaVIT/VideoLaVIT/models/modeling_motion_tokenizer.py", line 389, in reconstruct quantize, embed_ind = self.encode(x) File "/home/ayaan/LaVIT/VideoLaVIT/models/modeling_motion_tokenizer.py", line 431, in encode encoder_features = self.encoder(x) File "/home/ayaan/miniconda3/envs/lavit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/ayaan/miniconda3/envs/lavit/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/ayaan/LaVIT/VideoLaVIT/models/modeling_motion_tokenizer.py", line 236, in forward x = x + pos_embed RuntimeError: The size of tensor a (2880) must match the size of tensor b (720) at non-singleton dimension 2
How can i get the tokenizer working?