feature: Quickstart with smaller dataset

sign-language-processing / sign-vq

Vector Quantizer for Sign Language MediaPipe Poses

9 stars 2 forks source link

feature: Quickstart with smaller dataset #1

Open cleong110 opened 7 months ago

cleong110 commented 7 months ago

It would be real nice to have a smaller or even toy dataset for "get it running and give it a go" sort of purposes, so as to not need to download 500GB of pose data

Potentially DGS Corpus or RWTH Phoenix 2014 T?

cleong110 commented 7 months ago

After various issues (https://github.com/sign-language-processing/datasets/issues/65, https://github.com/sign-language-processing/datasets/issues/66, https://github.com/sign-language-processing/datasets/issues/67) I managed to get DGS Corpus running... until it crashed with a "Killed" message.

The default setting for DGS Corpus seems to download all the videos and load enough of them into memory that my system crashed.

cleong110 commented 7 months ago

https://stackoverflow.com/questions/65231843/is-it-possible-to-only-load-part-of-a-tensorflow-dataset suggests a potential solution. It may be that you can download the whole dataset, but only load a portion, by specifying something like split="train[:5%]"

cleong110 commented 7 months ago

https://www.tensorflow.org/datasets/splits#slicing_api

cleong110 commented 7 months ago

OK, answered my own question here, but let me test it out at least.

cleong110 commented 7 months ago

Made an issue about it, https://github.com/sign-language-processing/datasets/issues/68