tpulkit / txt2vid

Other
98 stars 18 forks source link

Txt2Vid: Ultra-Low Bitrate Compression of Talking-Head Videos via Text

Repo containing code for txt2vid project. Gives a proof-of-concept for the following compression pipeline (for more details read the paper on arXiv):

Motivation Pipeline.

Though the pipeline is flexible and can be replaced by appropriate software programs performing same function, the code currently uses and allows for Wav2Lip for lip-syncing, Resemble or Google APIs for personalized and general text-to-speech (TTS) synthesis respectively, and Google API for speech-to-text synthesis (STT). It uses ffmpeg-python to enable streaming.

Table of Contents

Demo Videos

Streaming txt2vid video on a port using streaming text input from terminal on a sender side

Click on the image below to play the demo video. Demo1

Streaming txt2vid video on a port using streaming text input from STT on a sender side

Click on the image below to play the demo video. Demo2

Installation Instructions

Notes

Environment Setup

Setup requirements using following steps on all machines (sender, server, receiver):

Wav2Lip Setup

Make sure model files are downloaded and put in appropriate folder from the wav2lip repo, on the machine where the decoding code will run (server).

Google STT and TTS Setup

To use Google API for TTS or STT, ensure following steps are executed:

Resemble TTS Setup

To use Resemble API, ensure following steps are executed:

Use-Cases

Currently, the repo allows following use cases:

Use Cases

The main scripts are

  1. Wav2Lip/inference_streaming_pipeline.py for handing the decoding by appropriately handling inputs through various pipes and queues in a multiprocess framework
  2. input_stream_socket.py for handling streaming input handling.

Below we describe a subset of these use-cases with an example from all store/stream modalities, in increasing order of complexity. See all available argument flags by:

cd Wav2Lip
python inference_streaming_pipeline.py -h

and

python input_stream_socket.py -h

Ensure Google or Resemble TTS setup is done for all use-cases involving text as described in Google STT and TTS Setup and Resemble TTS Setup.

Storing txt2vid video as file using text or audio file available at server

server (AV-synced streamed video)
^
|
pre-recorded audio/text + driving picture/video

Example Code:

On server launch the streaming inference script, and save the generated video.

Streaming txt2vid video on a port using text or audio file available at server

server (AV-synced streamed video) -----> receiver (view AV stream)
^
|
pre-recorded audio/text + driving picture/video

On server launch the streaming inference script, and port forward to stream the generated txt2vid video.

Example Code:

Note: We show use case with text as input and Resemble as TTS. Other use cases for Google TTS or audio file can be generated by changing appropriate flags.

Streaming txt2vid video on a port using streaming input from a sender

sender -----> server (AV-synced streamed video) -----> receiver (view AV stream)
^             ^
|             |
audio/video   driving picture/video

Example Code:

Note: We show use case with streaming text as input from terminal and Google STT, with Resemble as TTS. Other use cases for Google TTS or audio from microphone can be generated by changing appropriate flags.