[Inter-annotator alignment] Labeling pronouns and other anaphoric expressions

izaskr commented 4 years ago

Often the bars are referred to with pronouns, but labeling the pronouns in the same way as full names of bars would demand some more manual effort. We hope the language model will learn to generate anaphoric expressions, for example

Next is Spain <x_axis_label_highest_value>. It has a gender pay gap of 15 %. "It" is left unlabeled.

izaskr commented 4 years ago

Update: we decided to be more precise and run an off-the-shelf coreference resolution tool. This can help us keep track of entities and their respective labels.

Two tools (both pre-trained neural systems) have been experimented with briefly:

neuralcoref available in spaCy, implemented by the huggingface. I used the "en_core_web_sm" LM alongside with the tool.
SpanBERT available in AllenNLP. To install it correctly, see the instructions here.

Both are relatively easy to use, but based on a handful of examples from out dataset, they differ in performance.

Example sentence:

"This chart shows that the majority of people in Zarqa prefer reading a book in an evening. Although this would suggest they may prefer solitude"

Output from tools:

neuralcoref: no coreferences found (probably because it could not disambiguate)
SpanBERT: reference cluster (the majority of people in Zarqa, they)

Another example:

In Europe spending was £270 million, in Asia it was £180 million and Africa was the highest with £290 million.

neuralcoref: no coreferences found
SpanBERT: reference cluster (spending, it)

Conclusion: based on this, I suggest using SpanBERT to find coreferences and label the cases, where the head reference has label.

izaskr commented 4 years ago

Examples to discuss:

"Africa spends the least amount at 50 M dollars. This is exactly half of what is spend by Europe."
"this" refers to "50". SpanBERT identifies the coreference cluster (50 M dollars, this). We will have to decide what label "this" will carry, as all 3 tokens have a label of its own. I assume the best option is the label of "50".

izaskr commented 4 years ago

The entire data (all summaries) have now been processed for coreferences. Some questions to discuss:

Do we want to label non-entities and their references? Entities are bars, axis names and bar heights. Example: This chart shows the average amount of minutes spent on social media daily , by age group . It shows the 15-24 year age group spent the most time on social media. Corefences: this chart, it
Do we want to label bar height references? Example: 15 to 24 year olds spend 180 minutes, this more than halves to 70 minutes for the 55 to 64 year olds. Coreferences: 180 minutes, this

Some more examples talk about:

The graphics shows a trend where the older the group the less time they spend daily on social media , as we can see the decline after the group 15-24. Corefences: the group, they
This graph shows the average time spent on social media in maputo , by age group . The highest being 15-24 year olds , who spent around 175 minutes a day. This is followed by 25-34 at 160. Coreferences: 15-24 year olds , who spent around 175 minutes a day, This

izaskr / chart_descriptions

[Inter-annotator alignment] Labeling pronouns and other anaphoric expressions #45