sign-language-processing / sign-language-processing.github.io

Documentation and background of sign language processing
99 stars 9 forks source link

Update "Corpus NGT": Features and broken link #58

Closed cleong110 closed 3 weeks ago

cleong110 commented 3 weeks ago

image image

TODO:

Related:

cleong110 commented 3 weeks ago

One issue: it seems the NGT Corpus may be superseded by a newer dataset?

https://signbank.cls.ru.nl/ says:

image

cleong110 commented 3 weeks ago

Second issue: which link?

The original NGT Corpus is still available at:

cleong110 commented 3 weeks ago

Third issue: there's RGB video... and it's got multiple views/angles of multiple speakers. The fact that it's conversations between two speakers seems relevant. Is there a way to capture all this? Or do we just say video:RGB?

cleong110 commented 3 weeks ago

I wasn't sure what the vocabulary of glosses was, so I just... counted it. Turns out it's 3185. image

cleong110 commented 3 weeks ago

Here's the dump of gloss counts:

ngt_gloss_counts.json

cleong110 commented 3 weeks ago

ngt_gloss_counts_sorted.csv Here they are sorted. Looks like only about 800 glosses have 10 or more examples. And about 2300 have more than 1. The rest, about 800ish, are one-offs

cleong110 commented 3 weeks ago

Regardless, "3185" is the total gloss count I suppose.

cleong110 commented 3 weeks ago

image image

As for number of conversations, at the moment the official website lists 2278, not 2375. And then the dataloader lists 2280.

cleong110 commented 3 weeks ago

Official citation/PDF https://www.semanticscholar.org/paper/The-Corpus-NGT%3A-An-online-corpus-for-professionals/1a9e263920532e96f956c11aa70605e4488c9c6e

cleong110 commented 3 weeks ago

Not sure where "15 hours" comes from, the official citation says 12 at the time.

Similarly, the number of signers is all over the place depending on source.

cleong110 commented 3 weeks ago

Gonna go with this for "samples":

#samples": "~2375 multi-cam, multi-signer sessions",

which is within the length limits, shorter than other entries in the list of datasets