Since the text pipeline works as a proof of concept, adding on video
Using OpenCLIP
Since TVQA clips are 60-90 seconds, just for now sticking to the annotated start-end timestamps for a question which brings clip size down to ~15 seconds. This can be increased later
Not doing any facial recognition for now
Sampling 1FPS and running em all through CLIP alongside H0s, if confidence above some threshold then it's in the video
Problem with this setup is it ignores motion: Video-language options include Vid2Seq and LaViLa, would be good to switch over to one of these options but I'd like to see how a SoTA image model does first, also clip will be much faster during the hack-and-build phase