sign-language-processing / sign-language-processing.github.io

Documentation and background of sign language processing
111 stars 9 forks source link

Update NCSLGR #79

Open cleong110 opened 3 months ago

cleong110 commented 3 months ago

https://github.com/cleong110/sign-language-processing.github.io/issues/21 used by SignBLEU. They say

We use the ELAN version of Boston University’s The National Center for Sign Language and Gesture Resources corpus (NCSLGR) (Neidle and Sclaroff, 2012)

Carol Neidle and Stan Sclaroff. 2012. National Center for Sign Language and Gesture Resources (NCSLGR) corpus. Boston University. ISLRN, American Sign Language Linguistic Research Project (ASLLRP), ISLRN 833-505-711564-4.

Which links to https://www.islrn.org/resources/833-505-711-564-4/, which links to https://www.bu.edu/asllrp/ncslgr.html as the source.

Currently we have an entry for NCSLGR, it goes to dataset:databases2007volumes, aka

Databases, NCSLGR. 2007. “Volumes 2–7.” American Sign Language Linguistic Research Project (Distributed on CD-ROM ….

and it's got some TODOs image

https://www.bu.edu/asllrp/ncslgr-for-download/download-info.html

cleong110 commented 3 months ago

http://asl.cs.depaul.edu/corpus/index.html actually might be the "ELAN Version" they mention in SignBLEU

cleong110 commented 3 months ago

Aha!

https://www.bu.edu/asllrp/data-credits.html image

cleong110 commented 3 months ago

But I still don't know the precise citation for the Corpus itself? It says cite the corpus AND this publication. ???

cleong110 commented 3 months ago

https://www.bu.edu/asllrp/publications.html doesn't have a paper called "The National Center for Sign language and Gesture Resources (NCSLGR) Corpus

cleong110 commented 3 months ago

I think I'll just... cite this:

@inproceedings{Vogler2012ANW,
  title={A new web interface to facilitate access to corpora: development of the ASLLRP data access interface},
  author={Christian Vogler and C. Neidle},
  year={2012},
  url={https://api.semanticscholar.org/CorpusID:58305327}
}
cleong110 commented 3 months ago

And maybe add a custom citation like this:

@misc{dataset:Neidle_2020_NCSLGR_ISLRN,
  type = {Languageresource},
  title = {National Center for Sign Language and Gesture Resources (NCSLGR) corpus. ISLRN 833-505-711-564-4},
  author = {Carol Neidle and Stan Sclaroff},
  year = {2012},
  publisher = {Boston University},
  url = {https://www.islrn.org/resources/833-505-711-564-4/}
}
cleong110 commented 3 months ago

Previously the JSON pointed to

databases2007volumes
cleong110 commented 3 months ago

In index.md that is cited only here:

###### Continuous sign corpora {-}
contain parallel sequences of signs and spoken language.
Available continuous sign corpora are extremely limited, containing 4-6 orders of magnitude fewer sentence pairs than similar corpora for spoken language machine translation [@arivazhagan2019massively].
Moreover, while automatic speech recognition (ASR) datasets contain up to 50,000 hours of recordings [@pratap2020mls], the most extensive continuous sign language corpus contains only 1,150 hours, and only 50 of them are publicly available [@dataset:hanke-etal-2020-extending].
These datasets are usually synthesized [@dataset:databases2007volumes;@dataset:Crasborn2008TheCN;@dataset:ko2019neural;@dataset:hanke-etal-2020-extending] or recorded in studio conditions [@dataset:forster2014extensions;@cihan2018neural], which does not account for noise in real-life conditions. Moreover, some contain signed interpretations of spoken language rather than naturally-produced signs, which may not accurately represent native signing since translation is now a part of the discourse event.
cleong110 commented 3 months ago

As for JSON updates: going off of https://www.bu.edu/asllrp/ncslgr-for-download/download-info.html, it seems there is:

Also

    Most of these data are from four native signers of ASL.

    This dataset includes 1,866 distinct canonical signs (i.e., grouping together very slight variants in production). The total number of sign tokens is 11,854.

    Restricting consideration to signs other than gestures and classifiers, there are 1,278 distinct canonical signs, and a total of 10,719 tokens.

    1,002 of the utterances in this collection are part of short spontaneous narratives (19). The remaining 885 utterances were elicited to illustrate a variety of constructions and sentence types.
cleong110 commented 3 months ago

Licensing is the big one: https://www.bu.edu/asllrp/data-credits.html

The data available from these pages can be used for research and education purposes, but cannot be redistributed without permission.

Commercial use, without explicit permission, is not allowed, nor are any patents and copyrights based on this material.

Those making use of these data must, in resulting publications or presentations, cite: The National Center for Sign Language and Gesture Resources (NCSLGR) Corpus and this publication:

    Carol Neidle and Christian Vogler [2012] "A New Web Interface to Facilitate Access to Corpora: Development of the ASLLRP Data Access Interface," Proceedings of the 5th Workshop on the Representation and Processing of Sign Languages: Interactions between Corpus and Lexicon, LREC 2012, Istanbul, Turkey.

and also include the following URL's: http://www.bu.edu/asllrp// and http://secrets.rutgers.edu/dai/queryPages/.