Closed RQsky closed 2 years ago
Hi @RQsky,
Thanks for your interest in out work! Your understanding is almost correct, except that the total subtitle length should be < qk -- we directly concat all these sentences as a single sequence, thus the input length would be quite short. Your OOM error might also because of this? Try limiting the sequence lengths. For the single transformer layer, we used the same implementation as in https://github.com/jayleicn/TVRetrieval/blob/master/baselines/crossmodal_moment_localization/model_xml.py#L71, configured as https://github.com/jayleicn/TVRetrieval/blob/master/baselines/crossmodal_moment_localization/model_xml.py#L11.
Thanks for your reply, I understand a lot. But there are still some things that I don't understand.
vlep_dev_release.jsonl
and vlep_subtitles.jsonl
. You use the entire video (i.e. friends_s03e09_seg02_clip_07_ep
), not the segment in the video (i.e. ts": [38.81, 40.37]
). Is the subtitle also the entire subtitle (i.e. the entire example data in vlep_subtitles.jsonl
I shew).BertAttention
instead of BertIntermediate
and BertOutput
, right?# vlep_dev_release.jsonl
{"example_id": 20142, "vid_name": "friends_s03e09_seg02_clip_07_ep", "ts": [38.81, 40.37],
"events": ["Ross will stop, turn and point at Monica.",
"Ross will stop and ask Monica why she is pointing at him."], "answer": 0, "split": "dev"}
# vlep_subtitles.jsonl
{"vid_name": "5mjZA7K8oEg_subs_002_00:03:00_00:04:00_ep", "sub": [{"text": "You can see they're luxury.", "start": 1.21, "end": 2.8}, {"text": "They're plating them right here,", "start": 2.8, "end": 4.13}, {"text": "and you can get them either to eat in the restaurant,", "start": 4.13, "end": 6.15}, {"text": "or to go, we got ours here,", "start": 6.15, "end": 7.49}, {"text": "they look gooey, stuffed with beef and onion.", "start": 7.49, "end": 9.94}, {"text": "And there they are.", "start": 9.94, "end": 11.51}, {"text": "The samsa, delicious looking samsa.", "start": 11.51, "end": 15.84}, {"text": "And they're scraping the black part off the bottom ones", "start": 15.84, "end": 18.74}, {"text": "that are a little bit too charcoaly,", "start": 18.74, "end": 20.87}, {"text": "with this cheese grater here,", "start": 20.87, "end": 22.38}, {"text": "but ours just look perfect.", "start": 22.38, "end": 23.93}, {"text": "The top, ultra-premium quality samsa here.", "start": 23.93, "end": 26.853}, {"text": "Mm, mm, oh wow.", "start": 32.54, "end": 37.253}]}
The setting in VALUE/HERO is the same as in the VLEP dataset paper. I recommend you to use the VALUE code https://github.com/VALUE-Leaderboard/StarterCode, it has full instructions and configs on how to run the code, and with better performance.
hello, Brother Lei, I'm trying to implement your baseline recently, but i have some problems, and I hope to get your help.