phueb / CHILDES-SRL

Research code for generating semantic role labels for CHILDES
15 stars 3 forks source link
allennlp language-learning srl

CHILDES-SRL

A corpus of semantic role labels auto-generated for 5M words of American-English child-directed speech.

Purpose

The purpose of this repository is to:

Inspiration and code for a BERT-based semantic role labeler comes from the AllenNLP toolkit. A SRL demo can be found here.

The code is for research purpose only.

Data

There are 2 manually annotated ("human-based") datasets, named after the year of their release:

The latter is an extended version of the former, which also includes SRL annotation for prepositions.

Further, this repository contains SRL labels generated by an automatic SRL tagger, applied to a custom corpus of approximately 5M words of American-English child-directed language, which can be found in data/pre_processed/childes-20191206_mlm.txt. The data file that contains both utterances and SRL annotation is in data/pre_processed/childes-20191206_srl.txt.

History

Generating the CHILDES-SRL corpus

To annotate 5M words of child-directed speech using a semantic role tagger, trained by AllenNLP, execute data_tools/make_srl_training_data_from_model.py

To generate a corpus of human-labeled semantic role labels for a small section of CHILDES, execute data_tools/make_srl_training_data_from_human.py

Quality of auto-generated tags

How well does AllenNLP SRL tagger perform on CHILDES 2008 SRL data? Below is a list of f1 scores, comparing its performance with that of trained human annotators.

      ARG-A1 f1= 0.00
      ARG-A4 f1= 0.00
     ARG-LOC f1= 0.00
        ARG0 f1= 0.95
        ARG1 f1= 0.93
        ARG2 f1= 0.79
        ARG3 f1= 0.44
        ARG4 f1= 0.80
    ARGM-ADV f1= 0.70
    ARGM-CAU f1= 0.84
    ARGM-COM f1= 0.00
    ARGM-DIR f1= 0.48
    ARGM-DIS f1= 0.68
    ARGM-EXT f1= 0.38
    ARGM-GOL f1= 0.00
    ARGM-LOC f1= 0.68
    ARGM-MNR f1= 0.68
    ARGM-MOD f1= 0.78
    ARGM-NEG f1= 0.99
    ARGM-PNC f1= 0.03
    ARGM-PPR f1= 0.00
    ARGM-PRD f1= 0.15
    ARGM-PRP f1= 0.39
    ARGM-RCL f1= 0.00
    ARGM-REC f1= 0.00
    ARGM-TMP f1= 0.84
      ARGRG1 f1= 0.00
      R-ARG0 f1= 0.00
      R-ARG1 f1= 0.00
  R-ARGM-CAU f1= 0.00
  R-ARGM-LOC f1= 0.00
  R-ARGM-TMP f1= 0.00
     overall f1= 0.88

Compatibility

Tested on Ubuntu 16.04, Python 3.6, and torch==1.2.0