sign-language-processing / datasets

TFDS data loaders for sign language datasets.
https://sign-language-processing.github.io/#existing-datasets
83 stars 27 forks source link

feat(dgs_corpus): add sentence level loading #19

Closed AmitMY closed 2 years ago

AmitMY commented 2 years ago

Note that the way tfds stores arrays of objects, is as objects of arrays. No big deal, just is how it is

{
'id': <tf.Tensor: shape=(), dtype=string, numpy=b'1183203'>, 

'paths': {
    'cmdi': <tf.Tensor: shape=(), dtype=string, numpy=b'/home/nlp/amit/tensorflow_datasets/downloads/sign-lang.uni-hambu.de_meine_cmdi_11832Uu-LBUD5Ry8Msgq1i-CX5Qqjt_ylVHwxENi1ZzzXibc.cmdi'>, 
    'eaf': <tf.Tensor: shape=(), dtype=string, numpy=b'/home/nlp/amit/tensorflow_datasets/downloads/sign-lang.uni-hambur.de_meined_eaf_1183205qfufbL-ISImHk7v4fYT7bDsx-ZSKSXXUmhbb5mlp3s.eaf'>, 
    'ilex': <tf.Tensor: shape=(), dtype=string, numpy=b'/home/nlp/amit/tensorflow_datasets/downloads/sign-lang.uni-hambu.de_meine_ilex_11832JtU1YIDsu6SFkRBXU5DHJjeFWUFhkgEqfbazJGRkP7E.ilex'>,
    'srt': <tf.Tensor: shape=(), dtype=string, numpy=b'/home/nlp/amit/tensorflow_datasets/downloads/sign-lang.uni-hambu.de_meine_srt_11832_en9iAPAnPuZQv1uhtNQoV61EfCI4ozBY95ECDFElBGa-0.srt'>
}, 

'sentence': {
    'end': <tf.Tensor: shape=(), dtype=int32, numpy=385360>, 
    'english': <tf.Tensor: shape=(), dtype=string, numpy=b'Well, I would assume not.'>, 
    'german': <tf.Tensor: shape=(), dtype=string, numpy=b'Hm, ich glaube nicht.'>, 

    'glosses': {
        'Gebärde': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-KOPFSCH\xc3\x9cTTELN1^', b'ICH2^*',b'$INDEX1^',b'WISSEN2B^', b'NEIN3A^'], dtype=object)>, 
        'Lexeme_Sign': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-SHAKE-HEAD1^', b'I2*', b'$INDEX1', b'TO-BELIEVE2B',b'NOT3A'], dtype=object)>, 
        'Sign': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-SHAKE-HEAD1^', b'I2^*', b'$INDEX1^',b'TO-KNOW-OR-KNOWLEDGE2B^', b'NO3A^'], dtype=object)>, 
        'end': <tf.Tensor: shape=(5,), dtype=int32, numpy=array([383120, 383860, 384320, 384600, 384960], dtype=int32)>, 'gloss': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-KOPFSCH\xc3\x9cTTELN1^', b'ICH2*', b'$INDEX1',b'GLAUBEN2B', b'NICHT3A'], dtype=object)>, 
        'hand': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'r', b'r', b'r', b'r', b'r'], dtype=object)>, 
        'start': <tf.Tensor: shape=(5,), dtype=int32, numpy=array([382820, 383820, 384120, 384480, 384820], dtype=int32)>
    }, 

    'id': <tf.Tensor: shape=(), dtype=string, numpy=b'a3127087'>, 
    'mouthings': {
        'end': <tf.Tensor: shape=(4,), dtype=int32, numpy=array([383120, 384320, 384600, 384960], dtype=int32)>, 
        'mouthing': <tf.Tensor: shape=(4,), dtype=string, numpy=array([b'[MG]', b'[MG]', b'glaub', b'nich{t}'], dtype=object)>, 
        'start': <tf.Tensor: shape=(4,), dtype=int32, numpy=array([382820, 383820, 384480, 384820], dtype=int32)>}, 
        'participant': <tf.Tensor: shape=(), dtype=string, numpy=b'A'>, 
        'start': <tf.Tensor: shape=(), dtype=int32, numpy=382820>
    }
}
bricksdont commented 2 years ago

Thank you Amit! Will test this,

Note that the way tfds stores arrays of objects, is as objects of arrays.

I don't understand that yet, could you elaborate? (in the example I can't see an "object of an array", and don't understand why it matters)

AmitMY commented 2 years ago

It means that while in the code there is a definition:

                "glosses": tfds.features.Sequence(
                    {
                        "start": tf.int32,
                        "end": tf.int32,
                        "gloss": tfds.features.Text(),
                        "hand": tfds.features.Text(),
                        "Lexeme_Sign": tfds.features.Text(),
                        "Gebärde": tfds.features.Text(),
                        "Sign": tfds.features.Text(),
                    }
                ),

TFDS reverses the order: instead of being a sequence of objects, it is an object of sequences

    'glosses': {
        'Gebärde': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-KOPFSCH\xc3\x9cTTELN1^', b'ICH2^*',b'$INDEX1^',b'WISSEN2B^', b'NEIN3A^'], dtype=object)>, 
        'Lexeme_Sign': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-SHAKE-HEAD1^', b'I2*', b'$INDEX1', b'TO-BELIEVE2B',b'NOT3A'], dtype=object)>, 
        'Sign': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-SHAKE-HEAD1^', b'I2^*', b'$INDEX1^',b'TO-KNOW-OR-KNOWLEDGE2B^', b'NO3A^'], dtype=object)>, 
        'end': <tf.Tensor: shape=(5,), dtype=int32, numpy=array([383120, 383860, 384320, 384600, 384960], dtype=int32)>, 'gloss': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'$GEST-NM-KOPFSCH\xc3\x9cTTELN1^', b'ICH2*', b'$INDEX1',b'GLAUBEN2B', b'NICHT3A'], dtype=object)>, 
        'hand': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'r', b'r', b'r', b'r', b'r'], dtype=object)>, 
        'start': <tf.Tensor: shape=(5,), dtype=int32, numpy=array([382820, 383820, 384120, 384480, 384820], dtype=int32)>
    }, 

Same data, represented differently

bricksdont commented 2 years ago

Other than that, I think the examples Colab could be extended with a sentence-level loading example, such as

from sign_language_datasets.datasets.dgs_corpus.dgs_corpus import DgsCorpusConfig

config = DgsCorpusConfig(name="only-annotations-sentence-level", version="1.0.0", include_video=False, include_pose=None, data_type="sentence")
dgs_corpus = tfds.load('dgs_corpus', builder_kwargs=dict(config=config))

for datum in itertools.islice(dgs_corpus["train"], 0, 5):

  print(datum)

and that perhaps with some clever __init__.py importing this:

from sign_language_datasets.datasets.dgs_corpus.dgs_corpus import DgsCorpusConfig

could be

from sign_language_datasets.datasets.dgs_corpus import DgsCorpusConfig