sign-language-processing / datasets

TFDS data loaders for sign language datasets.
https://sign-language-processing.github.io/#existing-datasets
79 stars 24 forks source link

SignBank loading: SignWriting: "AttributeError: 'numpy.ndarray' object has no attribute 'decode'" #70

Open cleong110 opened 1 month ago

cleong110 commented 1 month ago

This snippet from the example Colab notebook causes an AttributeError.

signbank = tfds.load(name='sign_bank')

for datum in itertools.islice(signbank["train"], 0, 10):
  print(datum['id'].numpy().decode('utf-8'), datum['sign_writing'].numpy().decode('utf-8'), [f.decode('utf-8') for f in datum['terms'].numpy()])

Rewriting it to be three print statements localizes to sign_writing

image

It seems this is because that is actually an array of shape (1,), rather than being bytes. Taking the first element, THEN calling decode works image

Compare rwth-phoenix-weather-2014t

image

cleong110 commented 1 month ago

Checking the first 5k data in the dataset, it seems there can be 0, 1, or 2 items.

image

Looking at the source code, we also see that the Feature is a Sequence https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/signbank/signbank.py#L198

cleong110 commented 1 month ago

I wonder if we can make it so that the library just automatically detects internally if it's a Sequence and prints accordingly?

cleong110 commented 1 month ago

The quick fix for this issue would be to simply edit the example notebook with a note, maybe something like:

signbank = tfds.load(name='sign_bank')

for datum in itertools.islice(signbank["train"], 0, 10):
  print(datum['id'].numpy().decode('utf-8'))
  for signwriting_item in datum["sign_writing"]: # This feature is a Sequence of strings
    print(signwriting_item.numpy().decode('utf-8'))
  print([f.decode('utf-8') for f in datum['terms'].numpy()])
cleong110 commented 1 month ago

My notebook where I test downloading SignBank: https://colab.research.google.com/drive/1hs_UjwKv_mMxZvtittI4AD--SA6cpT5k?usp=sharing