Update "Corpus NGT": Features and broken link

sign-language-processing / sign-language-processing.github.io

Documentation and background of sign language processing

99 stars 9 forks source link

Update "Corpus NGT": Features and broken link #58

Closed cleong110 closed 3 weeks ago

cleong110 commented 3 weeks ago

TODO:

[ ] Fix broken link https://www.ru.nl/corpusngtuk/
[ ] add features.
[ ] add any missing stats

cleong110 commented 3 weeks ago

One issue: it seems the NGT Corpus may be superseded by a newer dataset?

https://signbank.cls.ru.nl/ says:

cleong110 commented 3 weeks ago

Second issue: which link?

The original NGT Corpus is still available at:

https://corpusngt.nl/, which seems to be the official site.
https://hdl.handle.net/1839/8e5a77a3-8d1a-492a-bc86-9a3398b0809c, which seems to be a "persistent identifier" of an archived version at The Language Archive. But this is where SignBank's official site links to.

cleong110 commented 3 weeks ago

Third issue: there's RGB video... and it's got multiple views/angles of multiple speakers. The fact that it's conversations between two speakers seems relevant. Is there a way to capture all this? Or do we just say video:RGB?

cleong110 commented 3 weeks ago

I wasn't sure what the vocabulary of glosses was, so I just... counted it. Turns out it's 3185.

cleong110 commented 3 weeks ago

Here's the dump of gloss counts:

ngt_gloss_counts.json

cleong110 commented 3 weeks ago

ngt_gloss_counts_sorted.csv Here they are sorted. Looks like only about 800 glosses have 10 or more examples. And about 2300 have more than 1. The rest, about 800ish, are one-offs

cleong110 commented 3 weeks ago

Regardless, "3185" is the total gloss count I suppose.

cleong110 commented 3 weeks ago