Pose segmentation model on both the sentence and sign level
Code for the paper Linguistically Motivated Sign Language Segmentation.
pip install git+https://github.com/sign-language-processing/segmentation
To create an ELAN file with sign and sentence segments: (To demo this on a longer file, you can download a large pose file from here)
pose_to_segments --pose="sign.pose" --elan="sign.eaf" [--video="sign.mp4"]
We tag pose sequences with BIO (beginning/in/out) and try to classify each frame. Due to huge sequence sizes intended to work on (full videos), this is not done using a transformer. Loss is heavily weighted in favor of "B" as it is a "rare" prediction compared to I and O.
pose_embedding = embed_pose(pose)
pose_encoding = encoder(pose_embedding)
sign_bio = sign_bio_tagger(pose_encoding)
sentence_bio = sentence_bio_tagger(pose_encoding)
Optical flow is highly correlative to phrase boundaries.
3D hand normalization may assist the model with learning hand shape changes.
Watch this video to see how it's done.
This is an attempt to reproduce the methodology of Moryossef et al. (2020) on the DGS corpus. Since they used a different document split, and do not filter out wrong data, our results are not directly comparable. This model processes optical flow as input and outputs I (is signing) and O (not signing) tags.
python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=64 --encoder_depth=1 --encoder_bidirectional=false --optical_flow=true --only_optical_flow=true --weighted_loss=false --classes=io
We replace the IO tagging heads in E0 with BIO heads to form our baseline. Our preliminary experiments indicate that inputting only the 75 hand and body keypoints and making the LSTM layer bidirectional yields optimal results.
python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=256 --encoder_depth=1 --encoder_bidirectional=true
Or for the mediapi-skel dataset (only phrase segmentation)
# FPS is not relevant for mediapi-skel
export MEDIAPI_PATH=/shares/volk.cl.uzh/amoryo/datasets/mediapi/mediapi-skel.zip
export MEDIAPI_POSE_PATH=/shares/volk.cl.uzh/amoryo/datasets/mediapi/mediapipe_zips.zip
python -m sign_language_segmentation.src.train --dataset=mediapi_skel --pose=holistic --fps=0 --hidden_dim=256 --encoder_depth=1 --encoder_bidirectional=true
Although the 75 hand and body keypoints serve as an efficient minimal set for sign language detection/segmentation models, we investigate the impact of other nonmanual sign language articulators, namely, the face. We introduce a reduced set of 128 face keypoints that signify the signer's face contour.
python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=256 --encoder_depth=1 --encoder_bidirectional=true --pose_components POSE_LANDMARKS LEFT_HAND_LANDMARKS RIGHT_HAND_LANDMARKS FACE_LANDMARKS --pose_reduce_face=true
At every time step $t$ we append the optical flow between $t$ and $t-1$ to the current pose frame as an additional dimension after the $XYZ$ axes.
python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=256 --encoder_depth=1 --encoder_bidirectional=true --optical_flow=true
At every time step, we normalize the hand poses and concatenate them to the current pose frame.
python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=256 --encoder_depth=1 --encoder_bidirectional=true --optical_flow=true --hand_normalization=true
We add autoregressive connections between time steps to encourage consistent output labels. The logits at time step $t$ are concatenated to the input of the next time step, $t+1$. This modification is implemented bidirectionally by stacking two autoregressive encoders and adding their output up before the Softmax operation. However, this approach is inherently slow, as we have to fully wait for the previous time step predictions before we can feed them to the next time step.
python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=256 --encoder_depth=4 --encoder_bidirectional=true --encoder_autoregressive=true --optical_flow=true --hand_normalization=true --epochs=50 --patience=10
CAUTION: this experiment does not improve the model as expected and runs very slowly.
To test and evaluate a model, add the train=false
and --checkpoint
flag. Take E1 as an example:
python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=256 --encoder_depth=1 --encoder_bidirectional=true --train=false --checkpoint=./models/E1-1/best.ckpt
It's also possible to adjust the decoding algorithm by setting the b_threshold
and the o_threshold
:
python -m sign_language_segmentation.src.train --dataset=dgs_corpus --pose=holistic --fps=25 --hidden_dim=256 --encoder_depth=1 --encoder_bidirectional=true --train=false --checkpoint=./models/E1-1/best.ckpt --b_threshold=50 --o_threshold=50
To test on an external dataset, see evaluate_mediapi.py for an example.
@inproceedings{moryossef-etal-2023-linguistically,
title = "Linguistically Motivated Sign Language Segmentation",
author = {Moryossef, Amit and Jiang, Zifan and M{\"u}ller, Mathias and Ebling, Sarah and Goldberg, Yoav},
editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-emnlp.846",
doi = "10.18653/v1/2023.findings-emnlp.846",
pages = "12703--12724",
}