Open cleong110 opened 3 months ago
http://asl.cs.depaul.edu/corpus/index.html actually might be the "ELAN Version" they mention in SignBLEU
But I still don't know the precise citation for the Corpus itself? It says cite the corpus AND this publication. ???
https://www.bu.edu/asllrp/publications.html doesn't have a paper called "The National Center for Sign language and Gesture Resources (NCSLGR) Corpus
I think I'll just... cite this:
@inproceedings{Vogler2012ANW,
title={A new web interface to facilitate access to corpora: development of the ASLLRP data access interface},
author={Christian Vogler and C. Neidle},
year={2012},
url={https://api.semanticscholar.org/CorpusID:58305327}
}
And maybe add a custom citation like this:
@misc{dataset:Neidle_2020_NCSLGR_ISLRN,
type = {Languageresource},
title = {National Center for Sign Language and Gesture Resources (NCSLGR) corpus. ISLRN 833-505-711-564-4},
author = {Carol Neidle and Stan Sclaroff},
year = {2012},
publisher = {Boston University},
url = {https://www.islrn.org/resources/833-505-711-564-4/}
}
Previously the JSON pointed to
databases2007volumes
In index.md that is cited only here:
###### Continuous sign corpora {-}
contain parallel sequences of signs and spoken language.
Available continuous sign corpora are extremely limited, containing 4-6 orders of magnitude fewer sentence pairs than similar corpora for spoken language machine translation [@arivazhagan2019massively].
Moreover, while automatic speech recognition (ASR) datasets contain up to 50,000 hours of recordings [@pratap2020mls], the most extensive continuous sign language corpus contains only 1,150 hours, and only 50 of them are publicly available [@dataset:hanke-etal-2020-extending].
These datasets are usually synthesized [@dataset:databases2007volumes;@dataset:Crasborn2008TheCN;@dataset:ko2019neural;@dataset:hanke-etal-2020-extending] or recorded in studio conditions [@dataset:forster2014extensions;@cihan2018neural], which does not account for noise in real-life conditions. Moreover, some contain signed interpretations of spoken language rather than naturally-produced signs, which may not accurately represent native signing since translation is now a part of the discourse event.
As for JSON updates: going off of https://www.bu.edu/asllrp/ncslgr-for-download/download-info.html, it seems there is:
Also
Most of these data are from four native signers of ASL.
This dataset includes 1,866 distinct canonical signs (i.e., grouping together very slight variants in production). The total number of sign tokens is 11,854.
Restricting consideration to signs other than gestures and classifiers, there are 1,278 distinct canonical signs, and a total of 10,719 tokens.
1,002 of the utterances in this collection are part of short spontaneous narratives (19). The remaining 885 utterances were elicited to illustrate a variety of constructions and sentence types.
Licensing is the big one: https://www.bu.edu/asllrp/data-credits.html
The data available from these pages can be used for research and education purposes, but cannot be redistributed without permission.
Commercial use, without explicit permission, is not allowed, nor are any patents and copyrights based on this material.
Those making use of these data must, in resulting publications or presentations, cite: The National Center for Sign Language and Gesture Resources (NCSLGR) Corpus and this publication:
Carol Neidle and Christian Vogler [2012] "A New Web Interface to Facilitate Access to Corpora: Development of the ASLLRP Data Access Interface," Proceedings of the 5th Workshop on the Representation and Processing of Sign Languages: Interactions between Corpus and Lexicon, LREC 2012, Istanbul, Turkey.
and also include the following URL's: http://www.bu.edu/asllrp// and http://secrets.rutgers.edu/dai/queryPages/.
https://github.com/cleong110/sign-language-processing.github.io/issues/21 used by SignBLEU. They say
Carol Neidle and Stan Sclaroff. 2012. National Center for Sign Language and Gesture Resources (NCSLGR) corpus. Boston University. ISLRN, American Sign Language Linguistic Research Project (ASLLRP), ISLRN 833-505-711564-4.
Which links to https://www.islrn.org/resources/833-505-711-564-4/, which links to https://www.bu.edu/asllrp/ncslgr.html as the source.
Currently we have an entry for NCSLGR, it goes to dataset:databases2007volumes, aka
and it's got some TODOs
https://www.bu.edu/asllrp/ncslgr-for-download/download-info.html