swiss-ai / ml-4m

4M: Massively Multimodal Masked Modeling (NeurIPS 2023 Spotlight)
Apache License 2.0
0 stars 0 forks source link

Transform from v2d format into a metadata format and save in `metadata/` directory. #16

Open kdu4108 opened 1 month ago

kdu4108 commented 1 month ago

Goal: given v2d format of

 ├── 00000.tar
 |     ├── 00000.mp4
 |     ├── 00000.txt
 |     ├── 00000.json
 |     ├── 00001.mp4
 |     ├── 00001.txt
 |     ├── 00001.json
 |     └── ...
 |     ├── 10000.mp4
 |     ├── 10000.txt
 |     ├── 10000.json
 ├── 00001.tar
 |     ├── 10001.mp4
 |     ├── 10001.txt
 |     ├── 10001.json
 │     ...
 ...

produce a metadata/ modality data folder of the following format:

root/metadata/shard-00000.tar
 |     ├── 00000.json # this corresponds to one video.
 |     ├── 00001.json
 |     └── ...

Each json should look something like

{
    "video": {
        "fps": 10,
        "resolution": (512, 512),
        "dataset": "howto100m"
    },
},

(exact format/required keys TBD, since there probably should be more than just video metadata here? like maybe caption quality or something would be nice? 1st person vs 3rd person? what else?)

kdu4108 commented 1 month ago

@garjania What other metadata would you recommend being in here for MVP?

kdu4108 commented 1 month ago

Also what other metadata from the youtube metadata of v2d do you think might be useful to include?


"yt_meta_dict": {
        "info": {
            "id": "QW3-5OuWn4M",
            "title": "IBM SPSS",
            "thumbnail": "https://i.ytimg.com/vi/QW3-5OuWn4M/maxresdefault.jpg",
            "description": "For the past five years, King Fish has been creating a media channel for IBM to generate leads of senior IT decision makers and retain current customers.  We produce dozens of webcasts every year for numerous divisions within IBM. King Fish provides managed services, original content and audience development. \n\nKFM worked with IBM to develop video content on how SPSS Statistics can help their clients meet business goals with advanced data insight methods. The result? Much more effective than an info-graphic.",
            "uploader": "King Fish Media",
            "uploader_id": "KingFishMediaBoston",
            "uploader_url": "http://www.youtube.com/user/KingFishMediaBoston",
            "channel_id": "UCDy7Xb5vYxbmSosQmztCCcQ",
            "channel_url": "https://www.youtube.com/channel/UCDy7Xb5vYxbmSosQmztCCcQ",
            "duration": 122,
            "view_count": 116,
            "average_rating": null,
            "age_limit": 0,
            "webpage_url": "https://www.youtube.com/watch?v=QW3-5OuWn4M",
            "categories": [
                "Science & Technology"
            ],
            "tags": [
                "IBM",
                "technology",
                "statistics",
                "data",
                "analysis",
                "computers",
                "content marketing",
                "Software"
            ],
            "playable_in_embed": true,
            "live_status": "not_live",
            "release_timestamp": null,
            "comment_count": null,
            "chapters": null,
            "like_count": 1,
            "channel": "King Fish Media",
            "channel_follower_count": 10,
            "upload_date": "20131107",
            "availability": "public",
            "original_url": "http://youtube.com/watch?v=QW3-5OuWn4M",
            "webpage_url_basename": "watch",
            "webpage_url_domain": "youtube.com",
            "extractor": "youtube",
            "extractor_key": "Youtube",
            "playlist": null,
            "playlist_index": null,
            "display_id": "QW3-5OuWn4M",
            "fulltitle": "IBM SPSS",
            "duration_string": "2:02",
            "is_live": false,
            "was_live": false,
            "requested_subtitles": {
                "en": {
                    "ext": "vtt",
                    "url": "https://www.youtube.com/api/timedtext?v=QW3-5OuWn4M&caps=asr&xoaf=5&hl=en&ip=0.0.0.0&ipbits=0&expire=1676200746&sparams=ip%2Cipbits%2Cexpire%2Cv%2Ccaps%2Cxoaf&signature=A43F4C223A9DBC7E3BFBC61027FC5AF70D709AB5.B386EB52DD412DEFC3E8DBBCF7F30C442473CDA4&key=yt8&kind=asr&lang=en&fmt=vtt",
                    "name": "English"
                }
            },
            "_has_drm": null,
            "format": "137 - 1920x1080 (1080p)+251 - audio only (medium)",
            "format_id": "137+251",
            "ext": "mkv",
            "protocol": "https+https",
            "language": null,
            "format_note": "1080p+medium",
            "filesize_approx": 12831366,
            "tbr": 841.009,
            "width": 1920,
            "height": 1080,
            "resolution": "1920x1080",
            "fps": 30,
            "dynamic_range": "SDR",
            "vcodec": "avc1.640028",
            "vbr": 691.069,
            "stretched_ratio": null,
            "acodec": "opus",
            "abr": 149.94,
            "asr": 48000,
            "audio_channels": 2
        }
    },
``` (from https://github.com/iejMac/video2dataset/blob/main/examples/yt_metadata.md)
garjania commented 1 month ago

maybe we can also have something like tags? do we have access to it for the YouTube dataset?

garjania commented 1 month ago

besides fps and resolution, the other ones don't seem to provide useful information.