sign-language-processing / datasets

TFDS data loaders for sign language datasets.
https://sign-language-processing.github.io/#existing-datasets
79 stars 24 forks source link

Add download size information to documentation #71

Open cleong110 opened 1 month ago

cleong110 commented 1 month ago

A la https://github.com/tensorflow/datasets/issues/120, it would be helpful to have an estimate of how large each dataset is before downloading. Ideally, a breakdown by feature would be nice.

Currently taking a crack at the following:

cleong110 commented 3 weeks ago

Digging a bit deeper, tfds seems to have a script for generating documentation: https://github.com/tensorflow/datasets/blob/8e64e46efe1fe2bc9488dbf266a4a5400c422c42/tensorflow_datasets/scripts/documentation/document_datasets.py

which is used by another script, build_catalog.py: https://github.com/tensorflow/datasets/blob/8e64e46efe1fe2bc9488dbf266a4a5400c422c42/tensorflow_datasets/scripts/documentation/build_catalog.py

cleong110 commented 3 weeks ago

What I'm trying:

  1. clone the tensorflow_datasets repo
  2. activate an environment with sign_language_datasets installed
  3. run the documentation scripts.
cleong110 commented 3 weeks ago

Had to pip install pyyaml and pandas, then ran the build_catalog.py and it complained about not having a "stable_versions.txt".

That seems to come from https://github.com/tensorflow/datasets/blob/8e64e46efe1fe2bc9488dbf266a4a5400c422c42/tensorflow_datasets/scripts/freeze_dataset_versions.py

cleong110 commented 3 weeks ago

When I run THAT, it outputs 5812 datasets versions to a file in my conda env

/home/vlab/miniconda3/envs/tfds_sl/lib/python3.10/site-packages/tensorflow_datasets/stable_versions.txt.
cleong110 commented 3 weeks ago

So of course I want it to also register the sign language datasets, right? So I edited the file to import that as well, like so:

from absl import app

import tensorflow_datasets as tfds
import sign_language_datasets.datasets

def main(_):
  tfds.core.visibility.set_availables([
      tfds.core.visibility.DatasetType.TFDS_PUBLIC,
  ])

  registered_names = tfds.core.load.list_full_names()
  version_path = tfds.core.utils.tfds_write_path() / 'stable_versions.txt'
  version_path.write_text('\n'.join(registered_names))
  print(f'{len(registered_names)} datasets versions written to {version_path}.')

if __name__ == '__main__':
  app.run(main)

When I run it THEN, it writes 5858 dataset versions instead. Opening up stable_versions, I see a few SL datasets including autsl.

cleong110 commented 3 weeks ago

tfds_stable_versions_no_sl.txt tfds_stable_versions_sl.txt The two different versions of the .txt file, copied and renamed.

apparently the comm utility lets you find diffs easily image

output:

comm -23 tfds_stable_versions_sl.txt tfds_stable_versions_no_sl.txt > sl_stable_versions.txt
comm: file 1 is not in sorted order
comm: file 2 is not in sorted order
comm: input is not in sorted order

OK, let's sort then.

cleong110 commented 3 weeks ago

list filenames, pipe to gnu parallel (yes I will cite it don't worry), and use sort and output to _sorted.txt

ls tfds_stable_versions* | parallel sort --output {.}_sorted.txt {}
cleong110 commented 3 weeks ago

NOW:

comm -23 tfds_stable_versions_sl_sorted.txt tfds_stable_versions_no_sl_sorted.txt > tfds_stable_versions_sl_only.txt

Which gives us

asl_citizen/default/1.0.0
aslg_pc12/0.0.1
asl_lex/annotations/2.0.0
asl_lex/default/2.0.0
asl_signs/default/1.0.0
autsl/default/1.0.0
autsl/holistic/1.0.0
autsl/openpose/1.0.0
bsl_corpus/annotations/1.0.0
bsl_corpus/default/1.0.0
chicago_fs_wild/default/2.0.0
dgs_corpus/annotations/3.0.0
dgs_corpus/default/3.0.0
dgs_corpus/holistic/3.0.0
dgs_corpus/openpose/3.0.0
dgs_corpus/sentences/3.0.0
dgs_corpus/videos/3.0.0
dgs_types/annotations/3.0.0
dgs_types/default/3.0.0
dgs_types/holistic/3.0.0
dicta_sign/annotations/1.0.0
dicta_sign/default/1.0.0
dicta_sign/poses/1.0.0
how2_sign/default/1.0.0
mediapi_skel/default/1.0.0
ngt_corpus/annotations/3.0.0
ngt_corpus/default/3.0.0
ngt_corpus/videos/3.0.0
rwth_phoenix2014_t/annotations/3.0.0
rwth_phoenix2014_t/default/3.0.0
rwth_phoenix2014_t/poses/3.0.0
rwth_phoenix2014_t/videos/3.0.0
sem_lex/default/1.0.0
sign2_mint/annotations/1.0.0
sign2_mint/default/1.0.0
sign_bank/default/1.0.0
sign_suisse/default/1.0.0
sign_suisse/holistic/1.0.0
sign_typ/default/1.0.0
sign_wordnet/default/0.2.0
spread_the_sign/default/1.0.0
swojs_glossario/annotations/1.0.0
swojs_glossario/default/1.0.0
wlasl/default/0.3.0
wmtslt/annotations/1.2.0
wmtslt/default/1.2.0
cleong110 commented 3 weeks ago

Which, I'm just gonna overwrite the stable_versions.txt with that...


cat tfds_stable_versions_sl_only.txt > /home/vlab/miniconda3/envs/tfds_sl/lib/python3.10/site-packages/tensorflow_datasets/stable_versions.txt.
cleong110 commented 3 weeks ago

Sigh: image

Offending assertion: image

cleong110 commented 3 weeks ago

Note also that it's using the document_datasets.py in the site-packages, not in the cloned repo. /home/vlab/miniconda3/envs/tfds_sl/lib/python3.10/site-packages/tensorflow_datasets/scripts/documentation/document_datasets.py

Just gonna comment that bit and try again... FileNotFoundError: Error forasl_citizen: [Errno 2] No such file or directory: '/home/vlab/miniconda3/envs/tfds_sl/lib/python3.10/site-packages/tensorflow_datasets/scripts/documentation/tfds_to_pwc_links.json'

Which, digging into the code, it's looking for "# Filepath for mapping between TFDS datasets and PapersWithCode entries."

OK so in dataset_markdown_builder has a bunch of sections, we don't care about them. What if we comment those out?

image

cleong110 commented 3 weeks ago

Still no luck Getting weird auth token errors. Tried a few datasets.

image

I give up. This seems like a dead end.

cleong110 commented 1 week ago

Setup a script to simply loop through available datasets and tfds.load every builder config. Then I can read download and dataset size from the returned ds_info.

DGS Corpus is the one holdout, because the download process crashes very consistently. Even when passing it process_video=False I have not figured out any way to download the various configs other than "annotations". Spent two hours trying. And tfds has no method to download only, without preparing.

Who decided that download_and_prepare was a good idea for a function? Functions should do one thing!

cleong110 commented 4 days ago

managed to download many of the datasets and check the sizes, or log the error

{
    'AUTSL/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_size': 22.66 GiB, 'dataset_size': 577.97 GiB
    }), 
    'AUTSL/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_size': 13.80 GiB, 'dataset_size': 22.40 GiB
    }), 
    'AUTSL/openpose': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_size': 1.03 GiB, 'dataset_size': 3.35 GiB
    }), 
    'AslLex/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_size': 1.92 MiB, 'dataset_size': 14.79 MiB
    }), 
    'AslLex/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_size': 1.92 MiB, 'dataset_size': 14.79 MiB
    }), 
    'ChicagoFSWild/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_result': ExtractError('Error while extracting /media/vlab/storage/data/tfds/downloads/dl.ttic.edu_ChicagoFSWildPlusqUnEJhkfLO-xyo77-A39kON0j8HEXykw-6Nwasi2iPY.tgz to /media/vlab/storage/data/tfds/downloads/extracted/TAR_GZ.dl.ttic.edu_ChicagoFSWildPlusqUnEJhkfLO-xyo77-A39kON0j8HEXykw-6Nwasi2iPY.tgz: ')
    }), 
    'DgsTypes/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
    {'download_result': DownloadError('Failed to get url https: //www.sign-lang.uni-hamburg.de/korpusdict/clips/3252569_1.mp4. HTTP code: 404.')}), 
    'DgsTypes/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_result': TypeError("Error while serializing feature `views/pose/data`: `TensorInfo(shape=(None, None, 1, 576, 3), dtype=float32)`: 'NoneType' object cannot be interpreted as an integer")
        }), 
    'DgsTypes/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_size': 336.11 MiB, 'dataset_size': 1.72 MiB
        }), 
    'DictaSign/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_size': 2.58 GiB, 'dataset_size': 3.34 GiB
        }), 
    'DictaSign/poses': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_size': 2.18 GiB, 'dataset_size': 3.34 GiB
        }), 'DictaSign/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_size': 7.09 MiB, 'dataset_size': 1.15 MiB
        }), 'How2Sign/default': defaultdict(<function default_value at 0x7c267f6b76a0>,
        {'download_result': DownloadError('Failed to get url https: //drive.usercontent.google.com/download?id=1dYey1F_SeHets-UO8F9cE3VMhRBO-6e0&export=download. HTTP code: 404.')}), 'NGTCorpus/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 185.58 GiB, 'dataset_size': 1.43 MiB}), 'NGTCorpus/videos': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 185.58 GiB, 'dataset_size': 1.43 MiB}), 'NGTCorpus/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 76.40 MiB, 'dataset_size': 389.65 KiB}), 'RWTHPhoenix2014T/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': SSLError(MaxRetryError("HTTPSConnectionPool(host='www-i6.informatik.rwth-aachen.de', port=443): Max retries exceeded with url: /ftp/pub/rwth-phoenix/2016/phoenix-2014-T.v3.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))"))}), 'RWTHPhoenix2014T/videos': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': SSLError(MaxRetryError("HTTPSConnectionPool(host='www-i6.informatik.rwth-aachen.de', port=443): Max retries exceeded with url: /ftp/pub/rwth-phoenix/2016/phoenix-2014-T.v3.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))"))}), 'RWTHPhoenix2014T/poses': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 5.14 GiB, 'dataset_size': 7.67 GiB}), 'RWTHPhoenix2014T/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 806.71 KiB, 'dataset_size': 1.90 MiB}), 'Sign2MINT/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': JSONDecodeError('Expecting value: line 1 column 1 (char 0)')}), 'Sign2MINT/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': JSONDecodeError('Expecting value: line 1 column 1 (char 0)')}), 'SignBank/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 113.86 MiB, 'dataset_size': 140.10 MiB}), 'SignSuisse/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 2.77 MiB, 'dataset_size': 4.97 MiB}), 'SignSuisse/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 33.57 GiB, 'dataset_size': 9.96 GiB}), 'SignTyp/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': ConnectionError(MaxRetryError('HTTPSConnectionPool(host=\'signtyp.uconn.edu\', port=443): Max retries exceeded with url: /signpuddle/export.php (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7c2520789650>: Failed to resolve \'signtyp.uconn.edu\' ([Errno -2] Name or service not known)"))'))}), 'SignWordnet/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': ImportError('Please install nltk with: pip install nltk')}), 'SwojsGlossario/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 352.28 KiB, 'dataset_size': 79.99 KiB}), 'SwojsGlossario/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 352.28 KiB, 'dataset_size': 79.99 KiB}), 'Wlasl/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': Exception('die')}), 'asl_lex/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 1.92 MiB, 'dataset_size': 14.79 MiB}), 'asl_lex/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 1.92 MiB, 'dataset_size': 14.79 MiB}), 'autsl/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 22.66 GiB, 'dataset_size': 577.97 GiB}), 'autsl/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 13.80 GiB, 'dataset_size': 22.40 GiB}), 'autsl/openpose': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 1.03 GiB, 'dataset_size': 3.35 GiB}), 'dgs_corpus/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': 'DGS CORPUS IS GARBAGE'}), 'dgs_corpus/videos': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': 'DGS CORPUS IS GARBAGE'}), 'dgs_corpus/openpose': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 46.23 GiB, 'dataset_size': 27.56 GiB}), 'dgs_corpus/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': 'DGS CORPUS IS GARBAGE'}), 'dgs_corpus/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': 'DGS CORPUS IS GARBAGE'}), 'dgs_corpus/sentences': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': 'DGS CORPUS IS GARBAGE'}), 'dgs_types/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': DownloadError('Failed to get url https://www.sign-lang.uni-hamburg.de/korpusdict/clips/3252569_1.mp4. HTTP code: 404.')}), 'dgs_types/holistic': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': TypeError("Error while serializing feature `views/pose/data`: `TensorInfo(shape=(None, None, 1, 576, 3), dtype=float32)`: 'NoneType' object cannot be interpreted as an integer")}), 'dgs_types/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 336.11 MiB, 'dataset_size': 1.72 MiB}), 'dicta_sign/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 2.58 GiB, 'dataset_size': 3.34 GiB}), 'dicta_sign/poses': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 2.18 GiB, 'dataset_size': 3.34 GiB}), 'dicta_sign/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 7.09 MiB, 'dataset_size': 1.15 MiB}), 'ngt_corpus/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 185.58 GiB, 'dataset_size': 1.43 MiB}), 'ngt_corpus/videos': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 185.58 GiB, 'dataset_size': 1.43 MiB}), 'ngt_corpus/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 76.40 MiB, 'dataset_size': 389.65 KiB}), 'rwth_phoenix2014_t/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': SSLError(MaxRetryError("HTTPSConnectionPool(host='www-i6.informatik.rwth-aachen.de', port=443): Max retries exceeded with url: /ftp/pub/rwth-phoenix/2016/phoenix-2014-T.v3.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))"))}), 'rwth_phoenix2014_t/videos': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': SSLError(MaxRetryError("HTTPSConnectionPool(host='www-i6.informatik.rwth-aachen.de', port=443): Max retries exceeded with url: /ftp/pub/rwth-phoenix/2016/phoenix-2014-T.v3.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))"))}), 'rwth_phoenix2014_t/poses': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 5.14 GiB, 'dataset_size': 7.67 GiB}), 'rwth_phoenix2014_t/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 806.71 KiB, 'dataset_size': 1.90 MiB}), 'sign_wordnet/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': ImportError('Please install nltk with: pip install nltk')}), 'swojs_glossario/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 352.28 KiB, 'dataset_size': 79.99 KiB}), 'swojs_glossario/annotations': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_size': 352.28 KiB, 'dataset_size': 79.99 KiB}), 'wlasl/default': defaultdict(<function default_value at 0x7c267f6b76a0>, {'download_result': Exception('die')})}