Some video features are not found in pre-computed spatiotemporal CLIP features

msra-jqxu commented 11 months ago

Hi, I found that there is "v_6Ke30NtYOC0.pkl" as training data in the file "video_chatgpt_training.json"(obtained from "scripts/convert_instruction_json_to_training_format.py"), but it is not in the downloaded the pre-computed spatiotemporal CLIP features(link). How can I fix this problem?

Thanks!

mmaaz60 commented 11 months ago

Hi @msra-jqxu,

Thank You for your interest in our work. Some of the video files were corrupted in our case and it could be the reason why the clip feature files are missing, and the reason of the mismatch. You can try skipping these videos as we did in our experiments.

The filtering script we use is attached below for your reference. Note that it takes an additional command line argument (i.e. --clip_feature_path). Let me know if it solves the issue or if you have any further questions. Thank You.

import os
import json
import argparse

def parse_args():
    parser = argparse.ArgumentParser(description="Training")

    parser.add_argument("--input_json_file", required=True,
                        help="Path to input json file (i.e. VideoInstruct_Dataset.json)")
    parser.add_argument("--output_json_file", required=True,
                        help="Path to output json file (i.e. VideoInstruct_Dataset_Train.json)")
    parser.add_argument("--clip_feature_path", required=False, default="",
                        help="Path to generated CLIP feature paths to filter any missing video ids (optional).")

    args = parser.parse_args()

    return args

def main():
    args = parse_args()
    input_json_file = args.input_json_file
    output_json_file = args.output_json_file
    clip_feature_path = args.clip_feature_path

    clip_features_files_witout_extension = ""
    if clip_feature_path:
        clip_features_files = os.listdir(clip_feature_path)
        clip_features_files_witout_extension = []
        for file in clip_features_files:
            clip_features_files_witout_extension.append(file.split('.')[0])

    input_json_contents = json.load(open(input_json_file, 'r'))
    output_json_contents = []
    for i, content in enumerate(input_json_contents):
        valid = False
        if not clip_feature_path:
            valid = True
        elif content['video_id'] in clip_features_files_witout_extension:
            valid = True

        if valid:
            output_content = {'id': content['video_id'], 'video': f"{content['video_id']}.pkl", 'conversations': []}
            # This is critical
            if i % 2 == 0:
                output_content['conversations'].append({'from': 'human', 'value': f"{content['q']}\n<video>"})
            else:
                output_content['conversations'].append({'from': 'human', 'value': f"<video>\n{content['q']}"})
            output_content['conversations'].append({'from': 'gpt', 'value': content['a']})
            output_json_contents.append(output_content)

    print(f"Total annotations retained: {len(output_json_contents)}")
    with open(output_json_file, 'w') as f:
        json.dump(output_json_contents, f)

if __name__ == "__main__":
    main()

msra-jqxu commented 11 months ago

Hi, @mmaaz60 , The filtering script really works for me and now I can train the model successfully! Thanks very much!

By the way, I specify the parameter --model_name_or_path while training. A prompt pops up during training: ”You are using a model of type llava to instantiate a model of type VideoChatGPT. This is not supported for all configurations of models and can yield errors.“ Is this normal?

mmaaz60 commented 11 months ago

Hi @msra-jqxu,

This is normal. Thank you.

msra-jqxu commented 11 months ago

Thanks again! I will close this issue as completed.

msra-jqxu commented 11 months ago

Hi, @mmaaz60 @hanoonaR , I noticed that there are 10024 videos in training set of ActivityNet200 in official website. But I found 13329 videos in the training set I downloaded from here. Could you explain where those 3,000+ videos came from? thanks!

mbzuai-oryx / Video-ChatGPT

Some video features are not found in pre-computed spatiotemporal CLIP features #24