nert-nlp / cgel

CGEL trees.
Creative Commons Attribution 4.0 International
6 stars 3 forks source link

cgel

This repo contains CGELBank, a human-annotated treebank of English using the syntactic formalism of the Cambridge Grammar of the English Language (CGEL). The treebank is described in Reynolds et al. (2023), published at the Linguistic Annotation Workshop (LAW).

Status CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

Datasets

We annotated data from Twitter and the English Web Treebank (EWT).

To load the CGEL trees for scripting, use the cgel.py library.

Summary information is available in:

Gold Data

Corresponding .conllu files are also available alongside the datasets/*.cgel and datasets/trial/*.cgel files. EWT .conllu files are gold trees; other .conllu files are manual corrections of Stanza output.

All data was revised with the aid of consistency-checking scripts.

Other subdirectories contain older/silver versions of the trees.

Interannotator Data

Under datasets/iaa/:

Structure

Folders

Tests

To run tests locally:

$ python -m pytest

This will validate the trees and test distance metrics (Levenshtein and TED).

History

Resources

Overview of the project:

Brett Reynolds, Aryaman Arora, and Nathan Schneider (2023). Unified Syntactic Annotation of English in the CGEL Framework. Proc. of the 17th Linguistic Annotation Workshop (LAW-XVII), Toronto, Canada.

@inproceedings{cgelbank-law,
    address = {Toronto, Canada},
    title = {Unified Syntactic Annotation of {E}nglish in the {CGEL} Framework},
    author = {Reynolds, Brett and Arora, Aryaman and Schneider, Nathan},
    year = {2023},
    month = jul,
    url = {https://people.cs.georgetown.edu/nschneid/p/cgeltrees.pdf},
    booktitle = {Proc. of the 17th Linguistic Annotation Workshop (LAW-XVII)}
}

Annotation manual:

Brett Reynolds, Nathan Schneider, and Aryaman Arora (2023). CGELBank Annotation Manual v1.0. arXiv.

Further analysis:

Brett Reynolds, Aryaman Arora, and Nathan Schneider (2022). CGELBank: CGEL as a Framework for English Syntax Annotation. arXiv.

Aryaman Arora, Nathan Schneider, and Brett Reynolds (2022). A CGEL-formalism English treebank. MASC-SLL (poster), Philadelphia, USA.

Source data:

Natalia Silveira, Timothy Dozat, Marie-Catherine de Marneffe, Samuel Bowman, Miriam Connor, John Bauer, Chris Manning (2014). A Gold Standard Dependency Corpus for English. Proc. of the Ninth International Conference on Language Resources and Evaluation (LREC '14).

Ann Bies, Justin Mott, Colin Warner, Seth Kulick (2012). English Web Treebank. LDC.