Add audio-to-text pipeline

eliteprox commented 3 months ago

What does this pull request do? Explain your changes. (required)

Adds the new /audio-to-text pipeline to go-livepeer, supporting the openai/whisper-large-v3 model.

File formats supported are mp3, m4a, mp4, webm, and flac

This change requires https://github.com/livepeer/ai-worker/pull/103 and https://github.com/livepeer/lpms/pull/407

Specific updates (required)

Refactors handleAIRequest and processAIRequest to support new response types like TextResponse
Adds /audio-to-text endpoint to ai_mediaserver.go
Pricing fixed to one pixel per millisecond
Uses ffmpeg.GetCodecInfo to calculate duration and requires the lpms pull request above

How did you test each of these updates (required)

Tested with rich vocal audio up to 4 hours long. Regression tested other pipelines to ensure refactoring cause any issues.
Tested with all supported file formats and unsupported ones

curl request example:

curl --request POST   --url http://dev.eliteencoder.net:8937/audio-to-text --header 'Content-Type: multipart/form-data'   --form 'audio=@Recording.mp3'   --form 'model_id=openai/whisper-large-v3'   --form seed=123

Does this pull request close any open issues?

LIV-429 LIV-289

Checklist:

[x] Read the contribution guide
[x] make runs successfully
[x] All tests in ./test.sh pass
[x] README and other documentation updated
[ ] Pending changelog updated

eliteprox commented 3 months ago

Added error handlers to respond with "400 bad request" when duration cannot be calculated due to unsupported file format or file corruption. This prevents invalid jobs from being sent to the network.

emranemran commented 3 months ago

Overall LGTM. I would recommend using ffmpeg for audio track processing (like calculating durations) and only accepting audio input instead of videos -- ffmpeg can help with that as well and we could potentially commonize the Probe package I linked in catalyst-api in the comments. Also, we need to understand what the file length limits are like.

rickstaa commented 2 months ago

@eliteprox looks like our pipeline fails to detect the duration of the following file:

speech.zip

Do you maybe know why 🤔?

eliteprox commented 2 months ago

@eliteprox looks like our pipeline fails to detect the duration of the following file:

speech.zip

Do you maybe know why 🤔?

This one appears to be a concatenated file and ffmpeg has issues calculating the duration in this case. The recommended solution is to re-encode the input to a consistent output format like flac. This can be combined with the effort to send audio-only to the ai-worker to optimize the pipeline.

livepeer / go-livepeer

Add audio-to-text pipeline #3078