Closed AmitMY closed 2 years ago
Thank you Amit! Will test this,
Note that the way tfds stores arrays of objects, is as objects of arrays.
I don't understand that yet, could you elaborate? (in the example I can't see an "object of an array", and don't understand why it matters)
It means that while in the code there is a definition:
"glosses": tfds.features.Sequence(
{
"start": tf.int32,
"end": tf.int32,
"gloss": tfds.features.Text(),
"hand": tfds.features.Text(),
"Lexeme_Sign": tfds.features.Text(),
"Gebärde": tfds.features.Text(),
"Sign": tfds.features.Text(),
}
),
TFDS reverses the order: instead of being a sequence of objects, it is an object of sequences
'glosses': {
'Gebärde': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-KOPFSCH\xc3\x9cTTELN1^', b'ICH2^*',b'$INDEX1^',b'WISSEN2B^', b'NEIN3A^'], dtype=object)>,
'Lexeme_Sign': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-SHAKE-HEAD1^', b'I2*', b'$INDEX1', b'TO-BELIEVE2B',b'NOT3A'], dtype=object)>,
'Sign': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-SHAKE-HEAD1^', b'I2^*', b'$INDEX1^',b'TO-KNOW-OR-KNOWLEDGE2B^', b'NO3A^'], dtype=object)>,
'end': <tf.Tensor: shape=(5,), dtype=int32, numpy=array([383120, 383860, 384320, 384600, 384960], dtype=int32)>, 'gloss': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-KOPFSCH\xc3\x9cTTELN1^', b'ICH2*', b'$INDEX1',b'GLAUBEN2B', b'NICHT3A'], dtype=object)>,
'hand': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'r', b'r', b'r', b'r', b'r'], dtype=object)>,
'start': <tf.Tensor: shape=(5,), dtype=int32, numpy=array([382820, 383820, 384120, 384480, 384820], dtype=int32)>
},
Same data, represented differently
Other than that, I think the examples Colab could be extended with a sentence-level loading example, such as
from sign_language_datasets.datasets.dgs_corpus.dgs_corpus import DgsCorpusConfig
config = DgsCorpusConfig(name="only-annotations-sentence-level", version="1.0.0", include_video=False, include_pose=None, data_type="sentence")
dgs_corpus = tfds.load('dgs_corpus', builder_kwargs=dict(config=config))
for datum in itertools.islice(dgs_corpus["train"], 0, 5):
print(datum)
and that perhaps with some clever __init__.py
importing this:
from sign_language_datasets.datasets.dgs_corpus.dgs_corpus import DgsCorpusConfig
could be
from sign_language_datasets.datasets.dgs_corpus import DgsCorpusConfig
Note that the way
tfds
stores arrays of objects, is as objects of arrays. No big deal, just is how it is