iejMac / video2dataset

Easily create large video dataset from video urls
MIT License
548 stars 65 forks source link

Add CLIP embedding / CoCa captioning stage (first example of multistage) #112

Open iejMac opened 1 year ago

iejMac commented 1 year ago

Multistage

We want to be able to run video2dataset multiple times on the same dataset for cheaper/expensive operations. Example - downloading and subsampling/filtering vs. CLIP embedding of frames. First is cheap and can be done on CPU whereas second is expensive and likely requires GPU's to be done in a timely manner. The solution to this is the generalize the notion of an input shard to any shard. The workflow would be:

  1. Run video2dataset on your parquet with links + metadata and save the downloaded and subsampled dataset
  2. Run video2dataset on your downloaded and subsampled dataset and add the information you want to those shards (initially from reading, processing, deleting old one, saving new one with old + new data)
  3. Repeat 2. for all costly steps you want to do (dense optical flow, watermark removal, CLIP embedding, caption mining), later we might want to fuse all of this if we can in a meaningful way but this might be hard

TODO's

Implement Identity Stage (❌)

Implement open_clip Stage (❌)

Testing (❌)

iejMac commented 1 year ago

currently outsourced to clip-video-encode