mira-space / MiraData

Official repo for paper "MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions"
https://mira-space.github.io/
GNU General Public License v3.0
348 stars 9 forks source link

caption format #14

Open wren93 opened 1 month ago

wren93 commented 1 month ago

Hi, thanks for sharing this dataset. I wonder what's the meaning of the notations such as "|||0||| 120|||" in the captions? Thanks.

An example (dense caption): '[|||0||| 120||| A man in a dark trench coat and hat walks down a city street at night in the rain. The scene begins along a waterfront with boats docked in the water, and the man walks along a wet, reflective pavement. As he continues, he transitions from the waterfront to a bustling city street illuminated by streetlights and neon signs, with other pedestrians and vehicles visible. The rain continues to fall, creating a moody, atmospheric setting. Eventually, the man moves into a park area, where the rain has stopped, and the scene is brighter with autumnal trees and a cobblestone path. |||,|||360||| 480||| The video depicts a man walking along a waterfront promenade at sunset, wearing a black coat and a cap. The scene transitions from a serene, sunlit path lined with trees and benches to a bustling urban environment at night. The man continues his walk, moving from the tranquil, reflective waterside to a brightly lit bridge overlooking a cityscape filled with towering buildings and vibrant lights. The video captures the contrast between the calm, natural beauty of the sunset and the dynamic, illuminated city night. |||,|||240||| 360||| The video depicts a man walking down a city street in a video game. He is dressed in a dark trench coat and a hat, moving steadily through various urban environments. The cityscape transitions from a bustling, rainy street under an elevated train track to a quieter, more open area with modern architecture and streetlights illuminating the night. The man walks with purpose, navigating through both crowded and deserted areas, showcasing the dynamic and immersive environment of the game. |||,|||120||| 240||| A man navigating through various urban and park environments in a video game setting, transitioning from day to evening. The protagonist, dressed in a dark leather jacket and jeans, walks confidently down paths and city streets. As the video progresses, the environments shift from a serene park with autumnal trees to bustling city scenes under the glow of streetlights and neon signs. The atmosphere changes from the natural, soft lighting of sunset to the artificial, vivid lights of an urban night, enhancing the immersive experience of the game. |||,]'

Gymat commented 1 month ago

Hi, thanks for your question!

The "|||" notation is a special delimiter used to separate elements within the captions. It functions similarly to a comma, ",", but is used here because a comma might not be the most effective separator in this context. This helps in clearly distinguishing different sections or details in the dense captions.

If you have any more questions, feel free to ask!

wren93 commented 1 month ago

Thanks for the answer, does that mean |||0||| 120|||, |||120||| 240|||, etc refer to different chunks of the video, or they refer to the same video but are generated under different settings? What do the numbers in the '|||' mean?

Gymat commented 1 week ago

@wren93 Sorry for the late reply. The segments like |||0||| 120||| refer to different time chunks of the same video. The numbers within the ||| symbols represent the start and end times of each segment in the video. For example, |||0||| 120||| would indicate the portion of the video from 0 seconds to 120 seconds.