huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.22k stars 27.06k forks source link

Add AudioQuestionAnswering pipeline #33782

Open cdreetz opened 1 month ago

cdreetz commented 1 month ago

Feature request

A new AudioQuestionAnswering pipeline, just like DQA but instead of providing a document, applying OCR, and doing QA over it, provide audio file, apply STT, and do QA over the transcript. Advanced version includes diarization+STT as speaker annotations provide important context and will improve QA/understanding.

Motivation

This kind of pipeline is one that I have had to build on multiple occasions for processing audio, specifically phone call recordings. Just like the other pipelines which provide accessibility to some applied ML based pipeline for those to use quickly and easily, this will provide the same thing just for a different modality than what is currently provided.

Your contribution

I plan to contribute the entire pipeline. My inspiration and what I plan to base a lot of the PR for this pipeline comes from #18414.

I'm mostly just posting this issue to get feedback from HF team. Tagging @Narsil @NielsRogge as they also provided feedback on the DQA PR.

LysandreJik commented 1 month ago

cc @ylacombe @eustlb @Rocketknight1

Rocketknight1 commented 1 month ago

I think this is quite an interesting idea, and I'd support it as a pipeline (even though we don't have a matching Hub spec for it yet). cc @sanchit-gandhi who I think worked on diarization as well.

Overall though, I'd be happy to accept and review the PR, unless anyone else has objections!

cdreetz commented 1 month ago

Hey @Rocketknight1, thanks for the willingness to help! I've implemented a working version and iterated on it a bunch, but am at a point I think it would be best to get the opinions of maintainers. A few things undecided I would love some input on:

Rocketknight1 commented 1 month ago

Hmm, I see! I didn't realize when you first proposed this that it combined two separate models that weren't trained together. That is unusual for pipelines - is there a reason to use a single pipeline for this task, instead of just calling a STT pipeline and then passing output to an Instruct?