iejMac / clip-video-encode

Easily compute clip embeddings from video frames
MIT License
132 stars 19 forks source link

implement reading video2dataset webdataset format #63

Closed iejMac closed 1 year ago

iejMac commented 1 year ago

Now that video2dataset is in a decent state we have a few large video webdatasets and we want to allow clip-video-encode to read those videos and encode them into CLIP embeddings. How do we do that? For now let's just implement 2 modes for the main code and later we can optimize it more maybe.

clip_video_encode.py changes:

  1. Add a parameter for input_format which can be "table" for the current default or "webdataset" for video2dataset output
  2. If input_format is "webdataset" we don't want to do things that would read regular input and do regular output - basically skip all parquet specific things. We can do this with some if statement on input_format. Do all common things before that
  3. Distribute should distribute shards over all workers instead of videos
  4. For each worker iterate over the shards, crack each open, and read things from that. You can probably write some custom WebDatasetReader like EmbeddingWebDatasetReader and read each shard separately, extract video paths, metadata, pop video paths into FrameReader and copy the normal clip-video-encode loop etc.
  5. Write shards as we read them i.e. clip-video-encode output_shards should be the same in terms of samples as video2dataset shards.