text-to-pose: Compose signs

Given a sequence of signs: (this is FSW, SignWriting)

AS14c20S27106M518x529S14c20481x471S27106503x489 AS18701S1870aS2e734S20500M518x533S1870a489x515S18701482x490S20500508x496S2e734500x468 S38800464x496

We could split the signs to the first sign and the second. There needs to be an inference function that can "animate" the first sign, starting from a neutral position, then use the last frame from the first sign to generate the second sign (ideally could use the entire first sign, but context is expensive for memory).

This generation process should have a feature that disallows the iterative process to edit the first frame - so it does not change the appearance of the person over few iterations.

Notes:

Training on a dictionary-like scenario, this is a bit weird, because we try to predict "go from neutral position, to sign, to neutral position" and so the skeleton would always get back to neutral position between signs. This could either be "cropped" with a heuristic we already have, or not be a problem if using data like the dgs corpus. Another option would be to not use the last frame, but last N frames, reversed, such that the video starts from moving the hands up to the previous position.
Length of each sign should be calculated independently as mu and std, then, find the best length to match a sentence length, if one was provided (extremely important for subtitling) - https://github.com/sign-language-processing/transcription/issues/1

sign-language-processing / transcription

text-to-pose: Compose signs #4