Now that video2dataset is in a decent state we have a few large video webdatasets and we want to allow clip-video-encode to read those videos and encode them into CLIP embeddings. How do we do that? For now let's just implement 2 modes for the main code and later we can optimize it more maybe.
clip_video_encode.py changes:
Add a parameter for input_format which can be "table" for the current default or "webdataset" for video2dataset output
If input_format is "webdataset" we don't want to do things that would read regular input and do regular output - basically skip all parquet specific things. We can do this with some if statement on input_format. Do all common things before that
Distribute should distribute shards over all workers instead of videos
For each worker iterate over the shards, crack each open, and read things from that. You can probably write some custom WebDatasetReader like EmbeddingWebDatasetReader and read each shard separately, extract video paths, metadata, pop video paths into FrameReader and copy the normal clip-video-encode loop etc.
Write shards as we read them i.e. clip-video-encode output_shards should be the same in terms of samples as video2dataset shards.
Now that video2dataset is in a decent state we have a few large video webdatasets and we want to allow clip-video-encode to read those videos and encode them into CLIP embeddings. How do we do that? For now let's just implement 2 modes for the main code and later we can optimize it more maybe.
clip_video_encode.py changes: