livepeer / go-livepeer

Official Go implementation of the Livepeer protocol
http://livepeer.org
MIT License
542 stars 169 forks source link

Add audio-to-text pipeline #3078

Closed eliteprox closed 2 months ago

eliteprox commented 3 months ago

What does this pull request do? Explain your changes. (required)

Adds the new /audio-to-text pipeline to go-livepeer, supporting the openai/whisper-large-v3 model.

File formats supported are mp3, m4a, mp4, webm, and flac

This change requires https://github.com/livepeer/ai-worker/pull/103 and https://github.com/livepeer/lpms/pull/407

Specific updates (required)

How did you test each of these updates (required)

curl request example:

curl --request POST   --url http://dev.eliteencoder.net:8937/audio-to-text --header 'Content-Type: multipart/form-data'   --form 'audio=@Recording.mp3'   --form 'model_id=openai/whisper-large-v3'   --form seed=123

Does this pull request close any open issues?

LIV-429 LIV-289

Checklist:

eliteprox commented 3 months ago

Added error handlers to respond with "400 bad request" when duration cannot be calculated due to unsupported file format or file corruption. This prevents invalid jobs from being sent to the network.

emranemran commented 3 months ago

Overall LGTM. I would recommend using ffmpeg for audio track processing (like calculating durations) and only accepting audio input instead of videos -- ffmpeg can help with that as well and we could potentially commonize the Probe package I linked in catalyst-api in the comments. Also, we need to understand what the file length limits are like.

rickstaa commented 2 months ago

@eliteprox looks like our pipeline fails to detect the duration of the following file:

speech.zip

Do you maybe know why 🤔?

eliteprox commented 2 months ago

@eliteprox looks like our pipeline fails to detect the duration of the following file:

speech.zip

Do you maybe know why 🤔?

This one appears to be a concatenated file and ffmpeg has issues calculating the duration in this case. The recommended solution is to re-encode the input to a consistent output format like flac. This can be combined with the effort to send audio-only to the ai-worker to optimize the pipeline.