Open AmitMY opened 1 year ago
We are done with our segmentation model: https://arxiv.org/abs/2310.13960
We should integrate it by doing:
Removing free camera support - user should be able to stop and restart the camera
We perform pose estimation, and segmentation. Segments are stored as an array-of-arrays - sentences, and within them signs.
Segments are shown in multiple ways:
When hovering a segment, the video plays in a loop only of that segment
Problem
Given a pose sequence, we would like to perform two types of segmentation.
Sentence segmentation - every sentence should be then translated independently. Sign segmentation - every sign in a sentence should be transcribed to SignWriting independently.
Description
We currently have such a segmentation model https://github.com/sign-language-processing/transcription/tree/main/pose_to_segments Which works reasonably well for sentences, but not at all well for signs.
Should perhaps look into developing an autoregressive model like https://arxiv.org/pdf/2301.02214.pdf
That way, we could also perform this live.
Alternatives
Use the existing model, which is bi-directional, and will require re-running on the sequence every single time.
Additional context
No response