tsujuifu / pytorch_violet

A PyTorch implementation of VIOLET
137 stars 6 forks source link

Could you provide the script for generating 'txt_xxx.json' in '_data' #3

Closed ruiyan1995 closed 1 year ago

tsujuifu commented 2 years ago

Since each dataset has its original JSON file, the scripts should be different (but not too difficult, I believe). The _txtxxx.json is provided as the example format to follow.

ruiyan1995 commented 2 years ago

Thanks for your reply. Yes, but some of the details are different. For example, each video in MSRVTT-retrieval has more than one captions. Do you use the fisrt one or choose one at random in this work?

tsujuifu commented 2 years ago

Thanks for pointing this out!

Yes. MSRVTT-Retrieval has 10 captions for each video, and we treat them as independent items. Therefore, the JSON file should be like:

{
  "train": [
    {
      "video": "vid1",
      "caption": "cap1"
    },
    {
      "video": "vid1",
      "caption": "cap2"
    },
    {
      "video": "vid1",
      "caption": "cap3"
    },
    {
      "video": "vid2",
      "caption": "cap1"
    },
    {
      "video": "vid2",
      "caption": "cap2"
    },
    {
      "video": "vid2",
      "caption": "cap3"
    }
  ],
  "val": [],
  "test": []
}

For example, there are two captions for video5029 in txt_msrvtt-retrieval.json.

ruiyan1995 commented 2 years ago

Thanks. But I cannot reproduce the results of zero-shot retrieval on MSRVTT. I use the following settings: 1, JSfusion (9000 for train and 1000 for test) 2, For test, we choose the first caption as same as "Frozen in Time" details at here 3, load the best ckpt provided by you

I can only get test {'r@1': 0.22200000000000017, 'r@5': 0.5130000000000003, 'r@10': 0.6680000000000005, 'median': 5} of zero-shot retrieval on MSRVTT.

tsujuifu commented 2 years ago

Let me check it. The result is interesting where R@5 and R@10 are even higher than proposed 😂

Please first make sure that we use 5-sampled video frames and frame size 224 for downstream tasks.

ruiyan1995 commented 2 years ago

Yes, I confirm.

ruiyan1995 commented 2 years ago

@tsujuifu Hi, I have checked again. I used specific caption idx's in jsfusion provided by "jsfusion_val_caption_idx.pkl". So I want to know how do you get caption during testing on MSRVTT?

tsujuifu commented 2 years ago

We directly adopt from ClipBERT, so it should be the same as JSFusion.

ruiyan1995 commented 2 years ago

Thanks for your kindly helps. I have repoduced the results of MSRVTT-retrieval (R@1: 33.7) with finetuned ckpt (provided by you), but still cannot get promising results on the zero-shot setting.

chenbiaolong commented 2 years ago

@tsujuifu can you provide the whole txt_msrvtt-retrieval.json you use? I can't repoduce the result of downstream task of MSRVTT-retrieval use your best pretrained ckpt. I want to make sure I use the right train&test data

tsujuifu commented 2 years ago

Here is my used txt_msrvtt-retrieval.json