Official release of the corpus described in the paper:
Michael Kranzlein, Emma Manning, Siyao Peng, Shira Wein, Aryaman Arora, and Nathan Schneider (2020). PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English [link]. Proceedings of the 14th Linguistic Annotation Workshop.
PASTRIE is a corpus of English data from Reddit annotated with preposition supersenses from the SNACS inventory.
While the data in PASTRIE is in English, it was produced by presumed speakers of four L1s:
For details on how L1s were identified, see section 3.1 of Rabinovich et al. (2018).
Below is an example sentence from the corpus, where annotation targets are bolded and preposition supersenses are annotated with the notation SceneRole↝Function. Together, a scene role and function are known as a construal.
PASTRIE is released in the following formats. We expect that most projects will be best served by one of the JSON formats.
.conllulex
: the 19-column CoNLL-U-Lex format originally used for STREUSLE..json
: a JSON representation of the CoNLL-U-Lex that does not require a CoNLL-U-Lex parser..govobj.json
: an extended version of the JSON representation that contains information about each preposition's syntactic parent and object.PASTRIE mostly follows STREUSLE with respect to the data format and SNACS annotation practice. Primary differences in the annotations are:
SpaceAfter=No
to indicate alignment between the tokens and the raw text.V
.pastrie.conllulex is the source of truth, and the two .json
files are derived from it using the conllulex tools. Usage:
conllulex2json -c pastrie pastrie.conllulex pastrie.json
conllulex-govobj --no-edeps pastrie.json pastrie.govobj.json
(Note that the current pastrie.conllulex was created from an earlier version of it that did not contain some information such as LEXCAT. Cf. this commit.)