PSchroedl/lipsync pipeline feature

This new route at ‘/lipsync’ takes either a simple text input or an audio file along with a static image, producing an mp4 of lipsync’ed audio and video.

An optional parameter return_frames will return single frames following the schema used in the image-to-video pipeline.

If text is supplied instead of an audio file, FastSpeech2Conformer is used for TTS.

The text input and mp4 output options differ from the bounty requirements solely for ease of demo ( and debugging ) purposes and can quickly be removed if desired.

At the time of writing, a demo server is running at http://204.12.245.134:8002/docs#/default/lipsync

( Disclaimer - long audio or text sequences will OOM on the GPU and may not gracefully recover ) Real3DPortrait https://github.com/yerfor/Real3DPortrait is utilized for the audio to video synchronization pipeline, and a purpose built Conda environment is configured on the host - isolating the majority of the requirements.

Standing apart from this majority is one particular requirement that needed to be installed at the OS level. In lieu of bumping the version of our Ubuntu base image 20.04 → 22.04, I’ve created a separate dockerfile which builds the necessary version from source.

Lipsync pipeline specific instructions for running and debugging can be found at cmd/lipsync/README.md

The approach taken here was a bit atypical ( to pull in an entire repo to utilize for a pipeline ), but it was a personal goal was to make some improvement to developer velocity on the AI Pipeline. The changes in this PR establish a pattern that enables devs to test out and prototype new pipelines with and test existing open-source implementations without potentially conflicting or hard-to-resolve dependencies.

Further work would include implementing lower level inference logic from scratch to be able to more finely control model selection and loading/unloading/caching.

livepeer / ai-worker

PSchroedl/lipsync pipeline feature #120