some questions for your baseline

RQsky commented 2 years ago

hello, Brother Lei, I'm trying to implement your baseline recently, but i have some problems, and I hope to get your help.

First of all, I don't know whether my understanding of the model architecture is correct or not. I drew a detailed picture and I hope you can help me see it. Architecture diagram premise: a. Suppose that in an example, the video contains p frames, and each frame generates 4096D vectors (resnet-2048D, ResNeXt- 2048D, L2 normalized respectively, and then concat), video feature size: p4096D b. There are q sentences in the subtitles, each sentence can be up to k words, and the subtitle id size: qk c. There is one sentence in the answer, set the longest r words, and the answer id size: r
Secondly, the specific configuration of "a single transformer layer" is unknown, except for hidden size = 768. number of self-attention heads =? , and so on...
I also train this model on a single RTX 2080 ti, but I am troubled by CUDA out of memory, so I only set batch size = 1. What is the possible reason for this? Thank you!

jayleicn commented 2 years ago

Hi @RQsky,

Thanks for your interest in out work! Your understanding is almost correct, except that the total subtitle length should be < qk -- we directly concat all these sentences as a single sequence, thus the input length would be quite short. Your OOM error might also because of this? Try limiting the sequence lengths. For the single transformer layer, we used the same implementation as in https://github.com/jayleicn/TVRetrieval/blob/master/baselines/crossmodal_moment_localization/model_xml.py#L71, configured as https://github.com/jayleicn/TVRetrieval/blob/master/baselines/crossmodal_moment_localization/model_xml.py#L11.

RQsky commented 2 years ago

Thanks for your reply, I understand a lot. But there are still some things that I don't understand.

Regarding the input data you use, Here is an example data from vlep_dev_release.jsonl and vlep_subtitles.jsonl. You use the entire video (i.e. friends_s03e09_seg02_clip_07_ep), not the segment in the video (i.e. ts": [38.81, 40.37]). Is the subtitle also the entire subtitle (i.e. the entire example data in vlep_subtitles.jsonl I shew).
Refer to https://github.com/jayleicn/TVRetrieval/blob/master/baselines/crossmodal_moment_localization/model_components.py#L175 For the single transformer layer, it seems that you only use BertAttention instead of BertIntermediate and BertOutput, right?

# vlep_dev_release.jsonl
{"example_id": 20142, "vid_name": "friends_s03e09_seg02_clip_07_ep", "ts": [38.81, 40.37], 
 "events": ["Ross will stop, turn and point at Monica.", 
 "Ross will stop and ask Monica why she is pointing at him."], "answer": 0, "split": "dev"}

# vlep_subtitles.jsonl
{"vid_name": "5mjZA7K8oEg_subs_002_00:03:00_00:04:00_ep", "sub": [{"text": "You can see they're luxury.", "start": 1.21, "end": 2.8}, {"text": "They're plating them right here,", "start": 2.8, "end": 4.13}, {"text": "and you can get them either to eat in the restaurant,", "start": 4.13, "end": 6.15}, {"text": "or to go, we got ours here,", "start": 6.15, "end": 7.49}, {"text": "they look gooey, stuffed with beef and onion.", "start": 7.49, "end": 9.94}, {"text": "And there they are.", "start": 9.94, "end": 11.51}, {"text": "The samsa, delicious looking samsa.", "start": 11.51, "end": 15.84}, {"text": "And they're scraping the black part off the bottom ones", "start": 15.84, "end": 18.74}, {"text": "that are a little bit too charcoaly,", "start": 18.74, "end": 20.87}, {"text": "with this cheese grater here,", "start": 20.87, "end": 22.38}, {"text": "but ours just look perfect.", "start": 22.38, "end": 23.93}, {"text": "The top, ultra-premium quality samsa here.", "start": 23.93, "end": 26.853}, {"text": "Mm, mm, oh wow.", "start": 32.54, "end": 37.253}]}

jayleicn commented 2 years ago

The setting in VALUE/HERO is the same as in the VLEP dataset paper. I recommend you to use the VALUE code https://github.com/VALUE-Leaderboard/StarterCode, it has full instructions and configs on how to run the code, and with better performance.

jayleicn / VideoLanguageFuturePred

some questions for your baseline #9