AssertionError: Unsynchronized Poses in dgs_corpus Dataset (Document 1177918)

RongLirr commented 1 week ago

When loading the dgs_corpus dataset, an AssertionError occurs due to unsynchronized pose shapes within one of the documents.

Here is the error message: AssertionError: Document 1177918: The poses are not synchronized ([(28254, 1, 543, 3), (14127, 1, 543, 3)])

Assertion Code: datasets/sign_language_datasets/datasets/dgs_corpus/dgs_corpus.py assert all(p.body.data.shape == first_pose.body.data.shape for p in poses_values), f"Document {document_id}: The poses are not synchronized ({[p.body.data.shape for p in poses_values]})"

AmitMY commented 1 week ago

Thank you for making this a warning. What I suspect - one of the videos is at 50fps and another is at 25fps so they have a varying number of frames. This should be checked in the future, for now we have a warning :)

cleong110 commented 1 week ago

This happened for me as well, same Document number, 1177918.

It then tried to remove the incomplete dir and failed, with OSError: [Errno 39] Directory not empty

cleong110 commented 1 week ago

The OSError is then not caught, and the entire load operation fails as a result

cleong110 commented 2 days ago

@AmitMY Here's a stacktrace. The warning triggers an OSError, and then the OSError crashes the whole thing.


Traceback (most recent call last):
  File "/opt/home/cleong/envs/sldata/lib/python3.10/site-packages/tensorflow_datasets/core/utils/file_utils.py", line 125, in incomplete_dir
    yield tmp_dir
  File "/opt/home/cleong/envs/sldata/lib/python3.10/site-packages/tensorflow_datasets/core/dataset_builder.py", line 756, in download_and_prepare
    self._download_and_prepare(
  File "/opt/home/cleong/envs/sldata/lib/python3.10/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1752, in _download_and_prepare
    split_infos = self._generate_splits(dl_manager, download_config)
  File "/opt/home/cleong/envs/sldata/lib/python3.10/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1727, in _generate_splits
    future = split_builder.submit_split_generation(
  File "/opt/home/cleong/envs/sldata/lib/python3.10/site-packages/tensorflow_datasets/core/split_builder.py", line 436, in submit_split_generation
    return self._build_from_generator(**build_kwargs)
  File "/opt/home/cleong/envs/sldata/lib/python3.10/site-packages/tensorflow_datasets/core/split_builder.py", line 496, in _build_from_generator
    for key, example in utils.tqdm(
  File "/opt/home/cleong/envs/sldata/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/opt/home/cleong/envs/sldata/lib/python3.10/site-packages/sign_language_datasets/datasets/dgs_corpus/dgs_corpus.py", line 388, in _generate_examples
    assert all(
AssertionError: Document 1177918: The poses are not synchronized ([(28254, 1, 543, 3), (14127, 1, 543, 3)])

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/home/cleong/sldata_download.py", line 22, in <module>
    dataset, info = tfds.load(
  File "/opt/home/cleong/envs/sldata/lib/python3.10/site-packages/tensorflow_datasets/core/logging/__init__.py", line 176, in __call__
    return function(*args, **kwargs)
  File "/opt/home/cleong/envs/sldata/lib/python3.10/site-packages/tensorflow_datasets/core/load.py", line 661, in load
    _download_and_prepare_builder(dbuilder, download, download_and_prepare_kwargs)
  File "/opt/home/cleong/envs/sldata/lib/python3.10/site-packages/tensorflow_datasets/core/load.py", line 517, in _download_and_prepare_builder
    dbuilder.download_and_prepare(**download_and_prepare_kwargs)
  File "/opt/home/cleong/envs/sldata/lib/python3.10/site-packages/tensorflow_datasets/core/logging/__init__.py", line 176, in __call__
    return function(*args, **kwargs)
  File "/opt/home/cleong/envs/sldata/lib/python3.10/site-packages/tensorflow_datasets/core/dataset_builder.py", line 737, in download_and_prepare
    with utils.incomplete_dir(
  File "/opt/home/cleong/envs/sldata/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/opt/home/cleong/envs/sldata/lib/python3.10/site-packages/tensorflow_datasets/core/utils/file_utils.py", line 131, in incomplete_dir
    tmp_path.rmtree()
  File "/opt/home/cleong/envs/sldata/lib/python3.10/site-packages/etils/epath/gpath.py", line 220, in rmtree
    self._backend.rmtree(self._path_str)
  File "/opt/home/cleong/envs/sldata/lib/python3.10/site-packages/etils/epath/backend.py", line 193, in rmtree
    shutil.rmtree(path)
  File "/opt/home/cleong/envs/sldata/lib/python3.10/shutil.py", line 731, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/opt/home/cleong/envs/sldata/lib/python3.10/shutil.py", line 729, in rmtree
    os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/data/petabyte/cleong/data/tfds_sign_language_datasets/dgs_corpus/holistic/incomplete.9K8YL7_3.0.0'

AmitMY commented 2 days ago

@cleong110 can you share your exact command? Looks like maybe one of the poses is 25fps and the other is 50, so i wanna see how you run it

cleong110 commented 2 days ago

I am just calling tfds.load with "dgs_corpus/holistic" as the name. I'm trying to download some of the datasets locally, and used the following script, which lets me give it a name and it'll go.

Full script:

# https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/autsl/autsl.py
# /opt/home/cleong/envs/sldata/lib/python3.10/site-packages/sign_language_datasets/datasets/autsl/autsl.py

import tensorflow_datasets as tfds
import sign_language_datasets.datasets
from sign_language_datasets.datasets.config import SignDatasetConfig
from pathlib import Path
import argparse
import itertools

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="attempt to download a dataset from sign-language-datasets, e.g. 'dgs_corpus/holistic'")
    parser.add_argument("dataset_name", help="something like 'dgs_corpus'")
    parser.add_argument("--data_dir", type=Path, default=Path("/data/petabyte/cleong/data/tfds_sign_language_datasets"))
    args= parser.parse_args()

    data_dir = "/data/petabyte/cleong/data/tfds_sign_language_datasets"
    # config = SignDatasetConfig(name="only-annotations", version="1.0.0", include_video=False)
    # config = SignDatasetConfig(name="poses-please", include_pose="holistic")
    # autsl = tfds.load(name='autsl', data_dir=data_dir, builder_kwargs={"config": config})
    # autsl = tfds.load(name='autsl/holistic', data_dir=data_dir)
    dataset, info = tfds.load(
        name=str(args.dataset_name),
        # builder_kwargs={"config": config}, 
        data_dir=args.data_dir, 
        with_info=True)

    for datum in itertools.islice(dataset["train"], 0, 2):
        print(f"datum")
        print(datum)

    print(info)

I called that script thus:

python sldata_download.py "dgs_corpus/holistic" 2>&1|tee dgs_corpus_fails.txt

sign-language-processing / datasets

AssertionError: Unsynchronized Poses in dgs_corpus Dataset (Document 1177918) #79