[PARENT ISSUE] Data preprocessing and pseudolabeling

kdu4108 commented 3 months ago

We want to get video RGB, video RGB tokens, video bounding boxes, video transcriptions, and video descriptions downloaded in a format that matches what 4M expects. Maybe that's something like

root/video_rgb/shard-00000.tar
root/video_rgb/shard-00001.tar
root/video_rgb/shard-00002.tar

root/video_tok_rgb/shard-00000.tar
root/video_tok_rgb/shard-00001.tar
root/video_tok_rgb/shard-00002.tar

root/video_det/shard-00000.tar
root/video_det/shard-00001.tar
root/video_det/shard-00002.tar

root/video_transcript/shard-00000.tar
root/video_transcript/shard-00001.tar
root/video_transcript/shard-00002.tar

root/video_description/shard-00000.tar
root/video_description/shard-00001.tar
root/video_description/shard-00002.tar

except I'm not sure because maybe the text should just be like jsonlines or something? This is very much just a suggestion. First task is to decide what makes the most sense, second is to then implement it. Keep an eye also on https://github.com/swiss-ai/ml-4m/pull/1 because that loads the data and that PR will need to be fixed to correspond to the decisions made here (e.g., rn it assumes text is saved as JSONL, which I picked kinda arbitrarily and is def up for change).

To get video_rgb, we just need to download using video2dataset, probably, with some file saving shenanigans to make it fit our naming/path formats/requirements. To get video_tok_rgb, we need to run (for now) a pretrained tokenizer on the video_rgb files and save it appropriately with right filetype and names/paths/etc. To get video_det, we need to run the YOLO pseudolabeler on the video_rgb files and save appropriately (maybe as JSONL?) To get video_description, we need to run ???something??? on the video_rgb files and save appropriately (maybe as JSONL?) To get video_description, we need to run whisper on the video_rgb files and get the transcripts appropriately (maybe as JSONL?). (we can also start with the default youtube captions as an easier thing so we don't bring whisper in the mix yet.)

Thank you @yahya for taking the lead on implementing these steps. @garjania if you could provide feedback/suggestions on the right format for saving these things/how this corresponds with video2dataset that'd be super helpful! I think one concrete unknown to pursue in making the decision is to first look at how video2dataset stores files and decide whether we bend more to follow video2dataset or if we use that as an intermediary to extract the captions, etc. and form them into this format for 4M. also @vesteinn if you're familiar with v2d?

garjania commented 3 months ago

Regarding the save format, we can save them in any suitable format. Then to convert them to tar files to make it compatible with webdatasets, I can probably provide you a script that converts any data format into the tar format. It basically compresses clusters of sample points into tar shards.

kdu4108 commented 3 months ago

This is the video-dataset format:

 ├── 00000.tar
 |     ├── 00000.mp4
 |     ├── 00000.txt
 |     ├── 00000.json
 |     ├── 00001.mp4
 |     ├── 00001.txt
 |     ├── 00001.json
 |     └── ...
 |     ├── 10000.mp4
 |     ├── 10000.txt
 |     ├── 10000.json
 ├── 00001.tar
 |     ├── 10001.mp4
 |     ├── 10001.txt
 |     ├── 10001.json
 │     ...
 ...

Leveraging this, we want to pseudolabel/preprocess that into our format for each modality:

root/video_rgb/shard-00000.tar
 |     ├── 00000.mp4 # this corresponds to one video.
 |     ├── 00001.mp4
 |     └── ...

root/video_tok_rgb/shard-00000.tar
 |     ├── 00000.npy # this corresponds to one video. shape: something like (num_frames, H, C, W)
 |     ├── 00001.npy
 |     └── ...

root/video_det/shard-00000.tar
 |     ├── 00000.jsonl # this corresponds to one video. each line within it corresponds to one frame.
 |     ├── 00001.jsonl
 |     └── ...

root/video_transcript/shard-00000.tar
 |     ├── 00000.jsonl # this corresponds to one video. each line within it corresponds to one subsequence of frames.
 |     ├── 00001.jsonl
 |     └── ...

root/video_description/shard-00000.tar
 |     ├── 00000.jsonl # this corresponds to one video. each line within it corresponds to one subsequence of frames.
 |     ├── 00001.jsonl
 |     └── ...

Some more notes on each the modality representations:

video_rgb: each mp4 represents a different video. possibly there'll be other video formats too.
video_tok_rgb: each npy here represents the tokenizations of all frames in the corresponding mp4. the shape of *.npy here will be something like (num_frames, H, W, C).
video_det: each jsonl represents the bounding boxes for a video. The ith line within the jsonl is the bounding boxes for the ith frame of that video.
video_transcript: each jsonl represents the transcripts for a video. The ith line within the jsonl is the transcript for the ith subsequence of frames of that video. Note that the transcript need not be consecutive in all frames, e.g., you can see a skip from frames 5-10 with no transcripts.
video_description: like with transcripts, each jsonl represents the transcripts for a video. The ith line within the jsonl is the description for the ith subsequence of frames of that video. Note that the description needs to be consecutive in all frames.

Some more details/examples on what the jsonl files should look like for the text-based modalities for a single video: video_det:

[
        # FRAME 0 Bounding boxes
        {
            "num_instances": 5,
            "image_height": 512,
            "image_width": 906,
            "instances": [
                {
                    "boxes": [
                        0.4229210317134857,
                        0.00020096010121051222,
                        0.5715101361274719,
                        0.13699540495872498
                    ],
                    "score": 0.9029952883720398,
                    "class_id": 74,
                    "class_name": "clock",
                    "segmentation": [
                        [
                            0.5055187637969095,
                            0.1337890625,
                            ...
                        ]
                    ]
                },
                {
                    "boxes": [
                        ...
                    ],
                    ...
                },
                    ...
            ]
        },
        # FRAME 1 Bounding boxes
        {
            "num_instances": 5,
            "image_height": 512,
            "image_width": 906,
            "instances": [
                ...,
            ],
            ...
        }
]

video_transcript:

[
            {
                "transcript": "here's a transcript",
                "start_frame_index": 0,
                "end_frame_index": 5,
            },
            {
                "transcript": "here's another transcript",
                "start_frame_index": 10,
                "end_frame_index": 13,
            } 
]

video_description:

[
            {
                "description": "here's a description",
                "start_frame_index": 0,
                "end_frame_index": 5,
            },
            {
                "description": "here's another description",
                "start_frame_index": 5,
                "end_frame_index": 12,
            } 
]

kdu4108 commented 3 months ago

Let's break down the steps here a little bit to get from start to finish.

Step 1: Download data in v2d format (https://github.com/swiss-ai/ml-4m/issues/7). Step 2: Transform from v2d format into video_rgb format and save in video_rgb/ directory (https://github.com/swiss-ai/ml-4m/issues/10). Step 3: Transform from video_rgb format into video_tok_rgb format and save in video_tok_rgb/ directory (https://github.com/swiss-ai/ml-4m/issues/9). Step 4: Transform from video_rgb format into video_det format and save in video_det/ directory (https://github.com/swiss-ai/ml-4m/issues/11). Step 5: Transform from v2d format into video_transcript format and save in video_transcript/ directory (https://github.com/swiss-ai/ml-4m/issues/12) Step 6: Transform from v2d format into video_description format and save in video_description/ directory (https://github.com/swiss-ai/ml-4m/issues/13)

This image summarizes the dependency graph of what data type is transformed into what (as well as what the file representations within those data types/modalities are).

kdu4108 commented 2 months ago

Let's use /store/swissai/a08/data/4m-data as the root dir for storing the data.

kdu4108 commented 2 months ago

Updated data design

kdu4108 commented 2 months ago

This change requires:

adding a filter_raw function - https://github.com/swiss-ai/ml-4m/issues/23 - and running it on the datasets
pointing all of the pseudolabelers to extract/read from the filtered_raw/ directory instead of video_rgb/ (affecting swiss-ai/ml-4m#11, swiss-ai/ml-4m#12, swiss-ai/ml-4m#13, swiss-ai/ml-4m#10, swiss-ai/ml-4m#14, swiss-ai/ml-4m#15 , swiss-ai/ml-4m#9)

kdu4108 commented 2 months ago

Also, 4M allows for specifying multiple datasets, so we don't need to actually combine them into one big pool! See ml-4m/cfgs/default/4m/data/cc12m+coyo+c4/main/mix_mod21_all2allmix_rgb2all_capT5bias_C4.yaml for an example.

swiss-ai / ml-4m

[PARENT ISSUE] Data preprocessing and pseudolabeling #3