Open cleong110 opened 1 month ago
Checking the first 5k data in the dataset, it seems there can be 0, 1, or 2 items.
Looking at the source code, we also see that the Feature is a Sequence https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/signbank/signbank.py#L198
I wonder if we can make it so that the library just automatically detects internally if it's a Sequence and prints accordingly?
The quick fix for this issue would be to simply edit the example notebook with a note, maybe something like:
signbank = tfds.load(name='sign_bank')
for datum in itertools.islice(signbank["train"], 0, 10):
print(datum['id'].numpy().decode('utf-8'))
for signwriting_item in datum["sign_writing"]: # This feature is a Sequence of strings
print(signwriting_item.numpy().decode('utf-8'))
print([f.decode('utf-8') for f in datum['terms'].numpy()])
My notebook where I test downloading SignBank: https://colab.research.google.com/drive/1hs_UjwKv_mMxZvtittI4AD--SA6cpT5k?usp=sharing
This snippet from the example Colab notebook causes an AttributeError.
Rewriting it to be three print statements localizes to
sign_writing
It seems this is because that is actually an array of shape (1,), rather than being bytes. Taking the first element, THEN calling decode works
Compare
rwth-phoenix-weather-2014t