Closed ahalterman closed 4 years ago
Hi,
Thank you for these great questions.
We merge the annotations in the following way. For slot filling questions with text spans, if there is a chunk that is chosen by 3 workers, then it is the consensus annotation. However, we do notice that there are cases like this: 2 workers choose chunk A and 1 worker chooses chunk B, chunk A and chunk B have overlaps. In this case, we check to see if the shortest common text span meets our cutoff of 3 workers. If it meets, then we mark both A and B as correct responses (we do not take the shortest union as the merged annotation, as in our inspection for some cases the longer one seems to be better).
To make the annotation possible in a crowdsourcing platform, we directly provide the annotators with choices for them to choose, rather than ask them to select text spans. Choices are automatically extracted by Twitter tagging tool and mainly contain Noun Phrase chunks. We do notice there are some errors made by the chunker, e.g., containing extra tokens. During annotation, annotators are told it is OK to choose a chunk that contains 2-3 extra tokens, if there is no best fit.
Not sure if I answer your concerns? We could discuss more here.
Thanks,
Thanks! That helps a lot.
To clarify (1), how does the evaluation script handle the duplicates when it iterates over the gold spans? [here] If it's checking for an exact match, at least one of the overlapping spans will be a false negative.
Re (2), that makes a lot of sense for annotation. We had started building a token-level classifier so we could use some token-level grammatical features, but I think we'll switch classifying the provided noun chunks.
Yes I am working on updating the evaluation script for dealing with it. I guess it is also my plan to manually go through all tweets in the test set to fix this issue, i.e., keep only one of those overlapping spans in the candidate choices.
I'm working on the W-NUT challenge and I have some questions about the data annotation process that have come up while looking through a few randomly selected tweets.
How were the annotations from different annotators combined in the baseline model? In tweet 1237171610253053952, there are overlapping spans for one of the slots:
'part2-name.Response': [[200, 218], [189, 218]]
. The same thing happens in 1238163898324398087 :'part2-what_cure.Response': [[45, 54], [27, 54]]
and 1238163898324398087:'part2-what_cure.Response': [[45, 54], [27, 54]]
. I've taken the union of the tokens, but it made me wonder about how annotations were merged in the paper.Can you provide some more information on how the slots are defined and how annotators were trained? There seem to be lots of inconsistencies in how tokens are labeled. For instance, in tweet 1238406951232589824, "Coronavirus: Arsenal Coach Mikel Arteta And Chelsea’s Callum Hudson-Odoi Test Positive https://t.co/9MRwGENNhp https://t.co/5Uy5ojAY5M", "Arsenal" is in the "employer" slot, but "Chelsea" is inside one long "name" slot that contains two seperate people: "Mikel Arteta And Chelsea’s Callum Hudson-Odoi".
In a different tweet (1238241958490931208) "BREAKING NEWS! Arsenal Head Coach Tests Positive For Coronavirus (Read Details) https://t.co/8CnQr7Ua8H", "Arsenal" is part of "name", not "employer".
Another example where the "name" slot seems to have too many words is tweet 1237171610253053952, "@ABC @morningmika @realDonaldTrump @morningMika Where are OUR tests? So South Korea has drive in tests and Germany has drive in tests and the rest of the world has tests and we can't get a test for my very sick child?". Here, the "name" slot includes the entire phrase 'a test for my very sick child'.
Are these artifacts of the noun chunker that was used during annotation?
I know that this is messy text and I know from experience how difficult annotation projects are. Any guidance you can give us on how the annotators were trained, how each slot was defined, measures of coder agreement, etc, would be really helpful for us as we try to build a model!