We want to be able to run video2dataset multiple times on the same dataset for cheaper/expensive operations. Example - downloading and subsampling/filtering vs. CLIP embedding of frames. First is cheap and can be done on CPU whereas second is expensive and likely requires GPU's to be done in a timely manner. The solution to this is the generalize the notion of an input shard to any shard. The workflow would be:
Run video2dataset on your parquet with links + metadata and save the downloaded and subsampled dataset
Run video2dataset on your downloaded and subsampled dataset and add the information you want to those shards (initially from reading, processing, deleting old one, saving new one with old + new data)
Repeat 2. for all costly steps you want to do (dense optical flow, watermark removal, CLIP embedding, caption mining), later we might want to fuse all of this if we can in a meaningful way but this might be hard
TODO's
Implement Identity Stage (❌)
[x] Update dataloader, move from example -> package
[x] Implement identity or dummy stage (just to set up the pipeline - reading, deleting, saving again). We do this by defining a specific worker for a given stage and execute that worker and not just the Worker we have now.
Implement open_clip Stage (❌)
[ ] Implement open_clip stage which allows you to pick a model and initialize it
[ ] Load the video and iterate over the frames
[ ] Apply an open_clip transform to each or a specific frame (embed each frame or caption center frame f.e.)
Multistage
We want to be able to run video2dataset multiple times on the same dataset for cheaper/expensive operations. Example - downloading and subsampling/filtering vs. CLIP embedding of frames. First is cheap and can be done on CPU whereas second is expensive and likely requires GPU's to be done in a timely manner. The solution to this is the generalize the notion of an input shard to any shard. The workflow would be:
TODO's
Implement Identity Stage (❌)
Implement open_clip Stage (❌)
Testing (❌)