text-to-pose: Scale up sequence size

Due to our model being memory intensive (transformer, n^2 in sequence length), we limit all training data at a sequence length and batch_size. (filter here)

Currently, if memory serves me right, out of ~4000 videos in the dicta_sign dataset, the model trains on ~2500 because of the limit of 100 frames max. (more frames, we get an out-of-memory error. perhaps related to more than just the transformer)

The ideal backbone for the pose encoding, in my opinion, is an S4 model, while the text (usually a lot shorter) could still use a transformer.

We should experiment how to increase the input size for the model.

sign-language-processing / transcription

text-to-pose: Scale up sequence size #2