Open Wuziyi616 opened 1 week ago
Hi, yes! The videos are from the original datasets (UVO, Oops and DiDeMo).
Please let me know if you have any issues downloading them, and feel free to share your steps here (for future users).
Thanks for your reply. It's actually easier than I thought. I just go to their official websites, and select videos in the annotation json files. The only issue I met is about the DiDeMo dataset -- it's a very old dataset so the dataset website no longer works. Nevertheless, I found this issue reply and followed it to successfully download the raw videos.
More specifically:
.tar.gz
file from this websiteI'll leave this issue open in case I meet any issues when using the dataset, if that sounds good.
Great! You can also download the DiDeMo videos from a Google Drive linked in their repo's README file.
Great! You can also download the DiDeMo videos from a Google Drive linked in their repo's README file.
I checked that link first, but I think they only stored 13 missing videos instead of all videos in the drive? Maybe I'm missing smth
Edit: oh interesting, turns out this is a different Github repo, and the one I checked is the release version which has a different Google drive link. But anyways, I guess using a script is easier than downloading lots of videos from a drive folder with gdown, as gdown has a limit of maximum 50 files per folder
Again thank you for your prompt reply. I have some questions regarding the task split and annotations and hope you can kindly give me some hint:
data/uvo-test.json
, I found two stories both from video -FbSzomWtWw.mp4
, and their (start_times, end_times)
overlaps -- one is [0, 10.005333]
, another one is [0.705416, 4.366505]
. Their sentence_parts
also seem to be describing the same story. My guess is you have 2 annotators per video, and both annotations are taken in the dataset.data/tasks/uvo-test/story_cont.json
, and found that some stories only have one segment, e.g. the first two entries I posted below, which are the ones mentioned in question 2. I'm a bit confused because from my understanding, story continuation means we have some initial frames of the first segment, plus captions of more than one segments. But for the below two examples, since they both have only one text, shouldn't they be excluded from the story continuation task? IMO they are just action execution. This actually happens a lot, 1623 out of 2613 stories only have one segment. Should I exclude them from evaluation?video_name
, texts
, exact_frames_per_prompt
are all the same.>>> pprint(story_cont[0])
{'background': None,
'comment': 'UVO_dense_val_100_0',
'durations': None,
'exact_frames_per_prompt': [76],
'indices_to_select': None,
'npz_gt_video_end_frame': None,
'npz_gt_video_start_frame': 0,
'npz_video': 'storybench/npy_96x160pix_8fps/uvo-test/videos/-FbSzomWtWw.npy',
'npz_video_end_frame': 4,
'npz_video_start_frame': 0,
'skip_frames_after_generation': 4,
'storybench_mode': 'story_cont',
'texts': ['A man wearing a white t-shirt is sitting behind the table, eating a burger and enjoying it while giving a thumbs up when it tastes good while a person whose hand is visible is holding a fork and picking up the food.']}
>>> pprint(story_cont[1])
{'background': None,
'comment': 'UVO_dense_val_100_1',
'durations': None,
'exact_frames_per_prompt': [25],
'indices_to_select': None,
'npz_gt_video_end_frame': None,
'npz_gt_video_start_frame': 0,
'npz_video': 'storybench/npy_96x160pix_8fps/uvo-test/videos/-FbSzomWtWw.npy',
'npz_video_end_frame': 10,
'npz_video_start_frame': 0,
'skip_frames_after_generation': 10,
'storybench_mode': 'story_cont',
'texts': ['A person, whose hand is visible, is holding a fork and picking some food with it while a man wearing a white t-shirt is sitting and eating food and looking at it.']}
Sorry for so many questions but a few more after processing the data in detail:
data/uvo-test.json
and data/tasks/uvo-test/story_gen.json
. Sometimes the background descriptions don't match, sometimes the segment start/end timestamps don't match. See the attached example. Should I always use the one under tasks
? On DiDeMo
everything matches tho, and I haven't checked Oops.>>> annotations[322]['background_description']
'In the background, people are speaking. There is a brown surface, brown walls, brown kettles, a white dish, white bowls, brown bowls, a golden object, a bottle, some wooden objects and other miscellaneous items.'
>>> story_gen[322]['background']
'In the background, there are clay teapot, ceramic jar and the cups, and the table.'
@e-bug a gentle reminder of the above questions. Thank you!
Hi, thanks for this great work. In the README I see instructions for downloading the training data. However, I wonder where can I download the validation & testing data? In the json files I only see video path formatted like
storybench/xxx
, which are not links. Can you provide the links for downloading these videos?EDIT: ok I guess I can download them from the original video datasets. Hopefully the video path matches the original video name