On the code of generating “pred_dir”

wyzjack commented 8 months ago

Hi authors,

Congrats on your nice work! In your code here (https://github.com/mbzuai-oryx/Video-ChatGPT/blob/cb6f2259065c3b2036f3aefc4ca411726235f797/data/generate_instruction_qa_semi_automatic.py#L23C107-L24C21) you need to load the extracted context information from the raw videos. Could you provide the code for generating the contents in "pred_dir"?

Many thanks in advance!

hanoonaR commented 8 months ago

Hi @wyzjack,

Thank you for your interest in our work!

We haven't made the code for generating pred_dir publicly available. However, you can get an understanding of the process from Section 4.2 of our paper. In summary:

Data Enrichment: We use off-the-shelf models like BLIP-2 and GRiT to generate frame-level captions. Tag2Text is used for key-frame tagging.
Noise Filtering: Captions are filtered based on a high prediction threshold and alignment with Tag2Text tags to remove noisy or irrelevant data. Specifically, we employ word-level filtering. This filtering mechanism removes any frame-level caption from BLIP-2 or GRiT that does not match with the Tag2Text frame-level tags. Specifically, the vocabulary applied is the tag vocabulary of Tag2Text. The process involves extracting words from the frame-level captions that are within the predefined Tag2Text tags vocabulary and eliminating any captions that contain words not in the tags for the particular frame. This approach has helped effectively in removing noisy frame-level captions.
Caption Merging: GPT-3.5 is used to merge frame-level captions into a singular, coherent video-level caption.

Note on Settings:

Tag2Text uses Swin-B with an input size of 384.
BLIP2 has an input size of 224.
GRiT uses ViT-B with an input size of 384.
FPS for BLIP and GRIT: For applying BLIP and GRiT, we extracted 10 key-frames using Katna from every video. This was done to generate varied descriptions of the video since key-frames look for very different visual features. Regarding Katna, the average frames per clip would depend on the specific implementation, but in our case, we focused on extracting 10 key frames from each video.

Hope this helps. Feel free to ask if you have more specific questions, such as hyperparameters or other details.

wyzjack commented 8 months ago

Got it, thanks so much for your reply and information!

mbzuai-oryx / Video-ChatGPT

On the code of generating “pred_dir” #60