swiss-ai / ml-4m

4M: Massively Multimodal Masked Modeling (NeurIPS 2023 Spotlight)
Apache License 2.0
0 stars 0 forks source link

[PARENT ISSUE] Data preprocessing and pseudolabeling #3

Open kdu4108 opened 3 months ago

kdu4108 commented 3 months ago

We want to get video RGB, video RGB tokens, video bounding boxes, video transcriptions, and video descriptions downloaded in a format that matches what 4M expects. Maybe that's something like

root/video_rgb/shard-00000.tar
root/video_rgb/shard-00001.tar
root/video_rgb/shard-00002.tar

root/video_tok_rgb/shard-00000.tar
root/video_tok_rgb/shard-00001.tar
root/video_tok_rgb/shard-00002.tar

root/video_det/shard-00000.tar
root/video_det/shard-00001.tar
root/video_det/shard-00002.tar

root/video_transcript/shard-00000.tar
root/video_transcript/shard-00001.tar
root/video_transcript/shard-00002.tar

root/video_description/shard-00000.tar
root/video_description/shard-00001.tar
root/video_description/shard-00002.tar

except I'm not sure because maybe the text should just be like jsonlines or something? This is very much just a suggestion. First task is to decide what makes the most sense, second is to then implement it. Keep an eye also on https://github.com/swiss-ai/ml-4m/pull/1 because that loads the data and that PR will need to be fixed to correspond to the decisions made here (e.g., rn it assumes text is saved as JSONL, which I picked kinda arbitrarily and is def up for change).

To get video_rgb, we just need to download using video2dataset, probably, with some file saving shenanigans to make it fit our naming/path formats/requirements. To get video_tok_rgb, we need to run (for now) a pretrained tokenizer on the video_rgb files and save it appropriately with right filetype and names/paths/etc. To get video_det, we need to run the YOLO pseudolabeler on the video_rgb files and save appropriately (maybe as JSONL?) To get video_description, we need to run ???something??? on the video_rgb files and save appropriately (maybe as JSONL?) To get video_description, we need to run whisper on the video_rgb files and get the transcripts appropriately (maybe as JSONL?). (we can also start with the default youtube captions as an easier thing so we don't bring whisper in the mix yet.)

Thank you @yahya for taking the lead on implementing these steps. @garjania if you could provide feedback/suggestions on the right format for saving these things/how this corresponds with video2dataset that'd be super helpful! I think one concrete unknown to pursue in making the decision is to first look at how video2dataset stores files and decide whether we bend more to follow video2dataset or if we use that as an intermediary to extract the captions, etc. and form them into this format for 4M. also @vesteinn if you're familiar with v2d?

garjania commented 3 months ago

Regarding the save format, we can save them in any suitable format. Then to convert them to tar files to make it compatible with webdatasets, I can probably provide you a script that converts any data format into the tar format. It basically compresses clusters of sample points into tar shards.

kdu4108 commented 3 months ago

This is the video-dataset format:

 ├── 00000.tar
 |     ├── 00000.mp4
 |     ├── 00000.txt
 |     ├── 00000.json
 |     ├── 00001.mp4
 |     ├── 00001.txt
 |     ├── 00001.json
 |     └── ...
 |     ├── 10000.mp4
 |     ├── 10000.txt
 |     ├── 10000.json
 ├── 00001.tar
 |     ├── 10001.mp4
 |     ├── 10001.txt
 |     ├── 10001.json
 │     ...
 ...

Leveraging this, we want to pseudolabel/preprocess that into our format for each modality:

root/video_rgb/shard-00000.tar
 |     ├── 00000.mp4 # this corresponds to one video.
 |     ├── 00001.mp4
 |     └── ...

root/video_tok_rgb/shard-00000.tar
 |     ├── 00000.npy # this corresponds to one video. shape: something like (num_frames, H, C, W)
 |     ├── 00001.npy
 |     └── ...

root/video_det/shard-00000.tar
 |     ├── 00000.jsonl # this corresponds to one video. each line within it corresponds to one frame.
 |     ├── 00001.jsonl
 |     └── ...

root/video_transcript/shard-00000.tar
 |     ├── 00000.jsonl # this corresponds to one video. each line within it corresponds to one subsequence of frames.
 |     ├── 00001.jsonl
 |     └── ...

root/video_description/shard-00000.tar
 |     ├── 00000.jsonl # this corresponds to one video. each line within it corresponds to one subsequence of frames.
 |     ├── 00001.jsonl
 |     └── ...

Some more notes on each the modality representations:

Some more details/examples on what the jsonl files should look like for the text-based modalities for a single video: video_det:

[
        # FRAME 0 Bounding boxes
        {
            "num_instances": 5,
            "image_height": 512,
            "image_width": 906,
            "instances": [
                {
                    "boxes": [
                        0.4229210317134857,
                        0.00020096010121051222,
                        0.5715101361274719,
                        0.13699540495872498
                    ],
                    "score": 0.9029952883720398,
                    "class_id": 74,
                    "class_name": "clock",
                    "segmentation": [
                        [
                            0.5055187637969095,
                            0.1337890625,
                            ...
                        ]
                    ]
                },
                {
                    "boxes": [
                        ...
                    ],
                    ...
                },
                    ...
            ]
        },
        # FRAME 1 Bounding boxes
        {
            "num_instances": 5,
            "image_height": 512,
            "image_width": 906,
            "instances": [
                ...,
            ],
            ...
        }
]

video_transcript:

[
            {
                "transcript": "here's a transcript",
                "start_frame_index": 0,
                "end_frame_index": 5,
            },
            {
                "transcript": "here's another transcript",
                "start_frame_index": 10,
                "end_frame_index": 13,
            } 
]

video_description:

[
            {
                "description": "here's a description",
                "start_frame_index": 0,
                "end_frame_index": 5,
            },
            {
                "description": "here's another description",
                "start_frame_index": 5,
                "end_frame_index": 12,
            } 
]
kdu4108 commented 3 months ago

Let's break down the steps here a little bit to get from start to finish.

Step 1: Download data in v2d format (https://github.com/swiss-ai/ml-4m/issues/7). Step 2: Transform from v2d format into video_rgb format and save in video_rgb/ directory (https://github.com/swiss-ai/ml-4m/issues/10). Step 3: Transform from video_rgb format into video_tok_rgb format and save in video_tok_rgb/ directory (https://github.com/swiss-ai/ml-4m/issues/9). Step 4: Transform from video_rgb format into video_det format and save in video_det/ directory (https://github.com/swiss-ai/ml-4m/issues/11). Step 5: Transform from v2d format into video_transcript format and save in video_transcript/ directory (https://github.com/swiss-ai/ml-4m/issues/12) Step 6: Transform from v2d format into video_description format and save in video_description/ directory (https://github.com/swiss-ai/ml-4m/issues/13)

image This image summarizes the dependency graph of what data type is transformed into what (as well as what the file representations within those data types/modalities are).

kdu4108 commented 2 months ago

Let's use /store/swissai/a08/data/4m-data as the root dir for storing the data.

kdu4108 commented 2 months ago

Updated data design image

kdu4108 commented 2 months ago

This change requires:

kdu4108 commented 2 months ago

Also, 4M allows for specifying multiple datasets, so we don't need to actually combine them into one big pool! See ml-4m/cfgs/default/4m/data/cc12m+coyo+c4/main/mix_mod21_all2allmix_rgb2all_capT5bias_C4.yaml for an example.