LipReading

Main repository for LipReading with Deep Neural Networks

Introduction

The goal is to implement LipReading: Similar to how end-to-end Speech Recognition systems work, mapping high-fidelity speech audio to sensible characters and word level outputs, we will do the same for "speech visuals". In particular, we will take video frame input, extract the relevant mouth/chin signals as input to map to characters and words.

Overview

TODO
Architecture: High level pipeline
Setup: Quick setup and installation instructions
- SpaCy Setup: Setup for NLP utilities.
- Data Directories Structure: How data files are organized
- Collecting Data: See README_DATA_COLLECTION.md
- Getting Started: Finally get started on running things
- Tutorial on Configuration files: Tutorial on how to run executables via a config file
- Download Data: Collect raw data from Youtube.
- Generate Dataview: Generate dataview from raw data.
- Train Model: :train: Train :train:
- Examples: Example initial configurations to experiment.
- Tensorboard Visualization
Other Resources: Collection of reading material, and projects

TODO

A high level overview of some TODO items. For more project details please see the Github project

[x] Download Data (926 videos)
[x] Build Vision Pipeline (1 week) in review
[x] Build NLP Pipeline (1 week) wip
[x] Build Loss Fn and Training Pipeline (2 weeks) wip
[x] Train :train: and Ship :ship: wip

Architecture

There are two primary interconnected pipelines: a "vision" pipeline for extracting the face and lip features from video frames, along with a "nlp-inspired" pipeline for temporally correlating the sequential lip features into the final output.

Here's a quick dive into tensor dimensionalities

Vision Pipeline

Video -> Frames       -> Face Bounding Box Detection      -> Face Landmarking    
Repr. -> (n, y, x, c) -> (n, (box=1, y_i, x_i, w_i, h_i)) -> (n, (idx=68, y, x))

NLP Pipeline

 -> Letters  ->  Words    -> Language Model 
 -> (chars,) ->  (words,) -> (sentences,)

Datasets

all: 926 videos (projected, not generated yet)
large: 464 videos (failed at 35/464)
medium: 104 videos (currently at 37/104)
small: 23 videos
micro: 6 videos
nano: 1 video

Setup

Clone this repository and install the requirements. We will be using python3.

Please make sure you run python scripts, setup your PYTHONPATH at ./, as well as a workspace env variable.

git clone git@github.com:joseph-zhong/LipReading.git 
# (optional, setup venv) cd LipReading; python3  -m venv .

Once the repository is cloned, the last step for setup is to setup the repository's PYTHONPATH and workspace environment variable to take advantage of standardized directory utilities in ./src/utils/utility.py

Copy the following into your ~/.bashrc

export PYTHONPATH="$PYTHONPATH:/path/to/LipReading/" 
export LIP_READING_WS_PATH="/path/to/LipReading/"

Install the simple requirements.txt with PyTorch with CTCLoss, SpaCy, and others.

On MacOS for CPU capabilities only.

pip3 install -r requirements.macos.txt

On Ubuntu, for GPU support

pip3 install -r requirements.ubuntu.txt

SpaCy Setup

We need to install a pre-built English model for some capabilities

python3 -m spacy download en

Data Directories Structure

This allows us to have a simple standardized directory structure for all our datasets, raw data, model weights, logs, etc.

./data/
  --/datasets (numpy dataset files for dataloaders to load)
  --/raw      (raw caption/video files extracted from online sources)
  --/weights  (model weights, both for training/checkpointing/running)
  --/tb       (Tensorboard logging)
  --/...

See ./src/utils/utility.py for more.

Getting Started

Now that the dependencies are all setup, we can finally do stuff!

Configuration

Each of our "standard" scripts in ./src/scripts (i.e. not ./src/scripts/misc) take the standard argsparse-style arguments. For each of the "standard" scripts, you will be able to pass --help to see the expected arguments. To maintain reproducibility, cmdline arguments can be written in a raw text file with one argument per line.

e.g. for ./config/gen_dataview/nano

--inp=StephenColbert/nano

Represent the arguments to pass to ./src/scripts/generate_dataview.py, automatically passable via

./src/scripts/generate_dataview.py $(cat ./config/gen_dataview/nano)

The arguments will be used from left-to-right order, so if arguments are repeated, they will be overwritten by the latter settings. This allows for modularity in configuring hyperparameters.

(For demonstration purposes, not a working example)

./src/scripts/train.py \
    $(cat ./config/dataset/large) \
    $(cat ./config/train/model/small-model) \
    $(cat ./config/train/model/rnn/lstm) \
    ...

Train Model

Train Model

./src/scripts/train.py

Examples

Training on Micro

./src/scripts/train_model.py $(cat ./config/train/micro)

Tensorboard Visualization

See README_TENSORBOARD.md

Other Resources

This is a collection of external links, papers, projects, and otherwise potentially helpful starting points for the project.

Other Projects
Other Academic Papers
Aacdemic Datsets

Other Projects

Lip Reading - Cross Audio-Visual Recognition using 3D Convolutional Neural Networks (Jul. 2017, West Virginia University)
Lip reading using CNN and LSTM (2017, Stanford)
LipNet (Dec. 2016, DeepMind)
- Paper: https://arxiv.org/abs/1611.01599
- Original Repo: https://github.com/bshillingford/LipNet
- Working Keras Implementation: https://github.com/rizkiarm/LipNet

Other Academic Papers

Deep Audio-Visual Speech Recognition (Sept. 2018, DeepMind)
- https://arxiv.org/pdf/1809.02108.pdf
Lip Reading Sentences in the Wild (Jan. 2017, Deepmind)
- https://arxiv.org/pdf/1611.05358.pdf
- CNN + LSTM encoder, attentive LSTM decoder
LARGE-SCALE VISUAL SPEECH RECOGNITION (Oct. 2018, DeepMind)
- https://arxiv.org/pdf/1807.05162.pdf
Lip Reading in Profile (2017, Oxford)
- http://www.robots.ox.ac.uk/~vgg/publications/2017/Chung17a/chung17a.pdf
JOINT CTC-ATTENTION BASED END-TO-END SPEECH RECOGNITION USING MULTI-TASK LEARNING (Jan. 2017, CMU)
- https://arxiv.org/pdf/1609.06773.pdf
- Joint CTC + attention model
- Unofficial implementation
A Comparison of Sequence-to-Sequence Models for Speech Recognition (2017, Google & Nvidia)
- https://www.isca-speech.org/archive/Interspeech_2017/pdfs/0233.PDF
- CTC vs. attention vs. RNN-transducer vs. RNN-transducer w/ attention
EXPLORING NEURAL TRANSDUCERS FOR END-TO-END SPEECH RECOGNITION (July 2017, Baidu)
- https://arxiv.org/pdf/1707.07413.pdf
- CTC vs. attention vs. RNN-transducer

Academic Datasets

Lip Reading Datasets (Oxford)
- http://www.robots.ox.ac.uk/~vgg/data/lip_reading/

joseph-zhong / LipReading

readme