iejMac / video2dataset

Easily create large video dataset from video urls
MIT License
546 stars 65 forks source link

Adding diarization to whisperX or do inference with other models #238

Closed jun297 closed 1 year ago

jun297 commented 1 year ago

Hi, thank you for sharing nice work

I am new to video2dataset, and I do not fully understand what subsampler is in video2dataset. Currently, what I understand is that subsampler processes something on a given video set (resolution, frame rate, whisperx)

What I am trying to do is: given a large-scale video set, do inference in various models such as whisperx (+ diarization) or other vision models (per frame or per video)

This may be a dumb question, in this case, is adding a custom subsampler or modifying whisper sumbsampler a right direction?
like caption_subsampler or whisper_subsampler?

iejMac commented 1 year ago

Hey, good question! I probably could've been more clear in the descriptions. A subsampler is anything that takes an input modality (video, audio, images) and transforms it into another form (usually with lower dimensionality, hence subsampler).

If your goal is to distribute model inference using video2dataset and you want to use a modle which currently isn't supported you will need to implement a new worker and subsampler but you can basically copy the existing integration of WhisperX or the Captioning Worker, I suggest just checking out the commits that introduce these changes and copying those for your desired model. I am working on making this more easy.

As for Whisper. If you want to add diarization I imagine this will be more easy since it would probably just require some edits in the subsampler itself and add some args - https://github.com/iejMac/video2dataset/blob/main/video2dataset/subsamplers/whisper_subsampler.py

If you want to contribute this I think this would be a great contribution. Maybe add a boolean argument to the class like "diarization" and based on this argument you have an if statement in the call method which takes the audio and output subtitles and performs diarization. What do you think?

jun297 commented 1 year ago

Thank you for kind and detailed reply now I understood

I think the whisper thing is good to start. If I make something, I'll ask this again