huggingface / api-inference-community

Apache License 2.0
163 stars 61 forks source link

Audio-to-regions widget and community API for pyannote.audio #25

Open hbredin opened 2 years ago

hbredin commented 2 years ago

Opening an issue as per @osanseviero's suggestion on Twitter. Issue imported from https://github.com/pyannote/pyannote-audio/issues/835


pyannote.audio 2.0 will bring a unified pipeline API:

from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")
output = pipeline("audio.wav")   # or pipeline({"waveform": np.ndarray, "sample_rate": int})

where output is a pyannote.core.Annotation instance.

I just created a space that allows to test a bunch of pipelines shared on Hugginface Hub but it would be nice if those were testable directly in their own model card.

My understanding is that two things needs to happen

osanseviero commented 2 years ago

This is a cool proposal! In Twitter I did mention we can model this task with audio-to-audio and it would already work by outputting multiple audios. But having a nice custom widget more specific for the task would be very cool!

cc @mishig25 @julien-c WDYT?

Narsil commented 2 years ago

This is very cool !

Definitely a good target for audio-to-audio as a starter (no widget needed). audio-segmentation seems like a good fit for what you're trying to do (does not exist yet, but should cover multiple use cases)

julien-c commented 2 years ago

audio-token-classification? 😱 audio-to-structured?

not sure of the best new task type to keep some generality

But yeah could be cool to have it

Narsil commented 2 years ago

audio-token-classification? scream

You're actually pretty on spot on IMO, since token-classification is actually text-segmentation I think. It's also aligned with image-segmentation.

Which basically should be a list of "objects" found in text/audio/image + some descriptor of "where" it is in the original input those objects are. (audio and text are 1D with basically never non contiguous objects, so start + stop are enough, IMO) in image because it's 2D, a full mask is basically required even for contiguous objects (boxes is also a simplification).

osanseviero commented 2 years ago

Btw audio-segmentation (speech-segmentation) existed and we deprecated it in favor of audio-to-audio no @Narsil ?

Narsil commented 2 years ago

speech-segmentation was never deprecated, but it also never had widget support afaik.

It's output is not audio so I don't see how audio-to-audio could be used:

https://github.com/huggingface/huggingface_hub/blob/main/api-inference-community/docker_images/superb/app/pipelines/speech_segmentation.py

hbredin commented 2 years ago

Nice!

Would you recommend we update this PR to speech-segmentation then?

Narsil commented 2 years ago

I think we can keep the PR as is, merge it when ready, so things are functional (even though less than perfect).

And when support for audio-segmentation is ready (or even before), we can simply create a new PR.