nert-nlp / pastrie

PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English
Creative Commons Attribution Share Alike 4.0 International
5 stars 1 forks source link

PASTRIE

CC BY-SA 4.0

Official release of the corpus described in the paper:

Michael Kranzlein, Emma Manning, Siyao Peng, Shira Wein, Aryaman Arora, and Nathan Schneider (2020). PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English [link]. Proceedings of the 14th Linguistic Annotation Workshop.


Overview

PASTRIE is a corpus of English data from Reddit annotated with preposition supersenses from the SNACS inventory.

While the data in PASTRIE is in English, it was produced by presumed speakers of four L1s:

For details on how L1s were identified, see section 3.1 of Rabinovich et al. (2018).

Annotation Example

Below is an example sentence from the corpus, where annotation targets are bolded and preposition supersenses are annotated with the notation SceneRole↝Function. Together, a scene role and function are known as a construal.


Data Formats

PASTRIE is released in the following formats. We expect that most projects will be best served by one of the JSON formats.

PASTRIE mostly follows STREUSLE with respect to the data format and SNACS annotation practice. Primary differences in the annotations are:

Development

pastrie.conllulex is the source of truth, and the two .json files are derived from it using the conllulex tools. Usage:

conllulex2json -c pastrie pastrie.conllulex pastrie.json
conllulex-govobj --no-edeps pastrie.json pastrie.govobj.json

(Note that the current pastrie.conllulex was created from an earlier version of it that did not contain some information such as LEXCAT. Cf. this commit.)