Open xiaokj37 opened 1 month ago
For training, we pre-extract the cls embedding of each frame and project it using the mm_projector in the class VTimeLLMMetaModel. The relevant code can be found in model/vtimellm_arch.py. Additionally, you can refer to inference.py for the code related to extracting the cls embedding.
First of all, thank you very much for open-sourcing your work. According to your paper, VTimeLLm project the image cls token in to LLM embedding. I would like to ask where this part is implemented in the code. Looking forward to your reply.