Main repository for LipReading with Deep Neural Networks
The goal is to implement LipReading: Similar to how end-to-end Speech Recognition systems work, mapping high-fidelity speech audio to sensible characters and word level outputs, we will do the same for "speech visuals". In particular, we will take video frame input, extract the relevant mouth/chin signals as input to map to characters and words.
A high level overview of some TODO items. For more project details please see the Github project
There are two primary interconnected pipelines: a "vision" pipeline for extracting the face and lip features from video frames, along with a "nlp-inspired" pipeline for temporally correlating the sequential lip features into the final output.
Here's a quick dive into tensor dimensionalities
Video -> Frames -> Face Bounding Box Detection -> Face Landmarking
Repr. -> (n, y, x, c) -> (n, (box=1, y_i, x_i, w_i, h_i)) -> (n, (idx=68, y, x))
-> Letters -> Words -> Language Model
-> (chars,) -> (words,) -> (sentences,)
all
: 926 videos (projected, not generated yet)large
: 464 videos (failed at 35/464)medium
: 104 videos (currently at 37/104)small
: 23 videos micro
: 6 videosnano
: 1 videopython3
.Please make sure you run python scripts, setup your PYTHONPATH
at ./
, as well as a workspace env variable.
git clone git@github.com:joseph-zhong/LipReading.git
# (optional, setup venv) cd LipReading; python3 -m venv .
PYTHONPATH
and workspace environment variable to take advantage of standardized directory utilities in ./src/utils/utility.py
Copy the following into your ~/.bashrc
export PYTHONPATH="$PYTHONPATH:/path/to/LipReading/"
export LIP_READING_WS_PATH="/path/to/LipReading/"
requirements.txt
with PyTorch
with CTCLoss, SpaCy
, and others.On MacOS for CPU capabilities only.
pip3 install -r requirements.macos.txt
On Ubuntu, for GPU support
pip3 install -r requirements.ubuntu.txt
We need to install a pre-built English model for some capabilities
python3 -m spacy download en
This allows us to have a simple standardized directory structure for all our datasets, raw data, model weights, logs, etc.
./data/
--/datasets (numpy dataset files for dataloaders to load)
--/raw (raw caption/video files extracted from online sources)
--/weights (model weights, both for training/checkpointing/running)
--/tb (Tensorboard logging)
--/...
See ./src/utils/utility.py
for more.
Now that the dependencies are all setup, we can finally do stuff!
Each of our "standard" scripts in ./src/scripts
(i.e. not ./src/scripts/misc
) take the standard argsparse
-style
arguments. For each of the "standard" scripts, you will be able to pass --help
to see the expected arguments.
To maintain reproducibility, cmdline arguments can be written in a raw text file with one argument per line.
e.g. for ./config/gen_dataview/nano
--inp=StephenColbert/nano
Represent the arguments to pass to ./src/scripts/generate_dataview.py
, automatically passable via
./src/scripts/generate_dataview.py $(cat ./config/gen_dataview/nano)
The arguments will be used from left-to-right order, so if arguments are repeated, they will be overwritten by the latter settings. This allows for modularity in configuring hyperparameters.
(For demonstration purposes, not a working example)
./src/scripts/train.py \
$(cat ./config/dataset/large) \
$(cat ./config/train/model/small-model) \
$(cat ./config/train/model/rnn/lstm) \
...
./src/scripts/train.py
./src/scripts/train_model.py $(cat ./config/train/micro)
This is a collection of external links, papers, projects, and otherwise potentially helpful starting points for the project.