Closed cleong110 closed 3 weeks ago
One issue: it seems the NGT Corpus may be superseded by a newer dataset?
Second issue: which link?
The original NGT Corpus is still available at:
Third issue: there's RGB video... and it's got multiple views/angles of multiple speakers. The fact that it's conversations between two speakers seems relevant. Is there a way to capture all this? Or do we just say video:RGB
?
I wasn't sure what the vocabulary of glosses was, so I just... counted it. Turns out it's 3185.
Here's the dump of gloss counts:
ngt_gloss_counts_sorted.csv Here they are sorted. Looks like only about 800 glosses have 10 or more examples. And about 2300 have more than 1. The rest, about 800ish, are one-offs
Regardless, "3185" is the total gloss count I suppose.
As for number of conversations, at the moment the official website lists 2278, not 2375. And then the dataloader lists 2280.
Not sure where "15 hours" comes from, the official citation says 12 at the time.
Similarly, the number of signers is all over the place depending on source.
Gonna go with this for "samples":
#samples": "~2375 multi-cam, multi-signer sessions",
which is within the length limits, shorter than other entries in the list of datasets
TODO:
Related:
45