Add CLIP embedding / CoCa captioning stage (first example of multistage)

Multistage

We want to be able to run video2dataset multiple times on the same dataset for cheaper/expensive operations. Example - downloading and subsampling/filtering vs. CLIP embedding of frames. First is cheap and can be done on CPU whereas second is expensive and likely requires GPU's to be done in a timely manner. The solution to this is the generalize the notion of an input shard to any shard. The workflow would be:

Run video2dataset on your parquet with links + metadata and save the downloaded and subsampled dataset
Run video2dataset on your downloaded and subsampled dataset and add the information you want to those shards (initially from reading, processing, deleting old one, saving new one with old + new data)
Repeat 2. for all costly steps you want to do (dense optical flow, watermark removal, CLIP embedding, caption mining), later we might want to fuse all of this if we can in a meaningful way but this might be hard

TODO's

Implement Identity Stage (❌)

[x] Update dataloader, move from example -> package
[x] Implement identity or dummy stage (just to set up the pipeline - reading, deleting, saving again). We do this by defining a specific worker for a given stage and execute that worker and not just the Worker we have now.

Implement open_clip Stage (❌)

[ ] Implement open_clip stage which allows you to pick a model and initialize it
[ ] Load the video and iterate over the frames
[ ] Apply an open_clip transform to each or a specific frame (embed each frame or caption center frame f.e.)

Testing (❌)

[x] end2end multistage test (using dummy worker)
[ ] each worker should be tested

iejMac / video2dataset

Add CLIP embedding / CoCa captioning stage (first example of multistage) #112

Multistage

TODO's

Implement Identity Stage (❌)

Implement open_clip Stage (❌)

Testing (❌)