sign-language-processing / detection-train

Training a sign language detection model
7 stars 2 forks source link

Downloading DGS Corpus #3

Closed hshreeshail closed 2 years ago

hshreeshail commented 2 years ago

How large is the DGS Corpus when downloaded using the create_tfrecord_dgs_corpus.py script? When running the script, I am getting the following progress bar: progress_bar If the numbers here are to be believed, it seems like it will take a very very long time (6+ hours) to download the dataset. Note that the internet speed is not a bottleneck here since I am working on a 150Mbps connection and am getting a 80Mbps download speed.

AmitMY commented 2 years ago

If include_video is False, it should not download the videos, which are probably the largest files. Next largest are the pose files, probably a few 10s of GBs, but I don't know the exact size.

AmitMY commented 2 years ago

In the desired configuration, after downloading, and sharding, the final size of the database on disk is 146GB. It could be improved by using float16 instead of float32, but I do not see it as an issue currently.

hshreeshail commented 2 years ago

Got it. I did not find any ablation studies in your paper that compare results using fewer amount of training data. Given that we are training a single-layer LSTM model with at most ~50k parameters, having 146 GB of training data seems a bit excessive.