Clarify how invalid annotations will be handled

GillesJ commented 3 years ago

Now that multiple erroneous spans have been found, it remains currently unclear how the task organization will handle distribution of correct(ed) data.

Are participants themselves responsible for cleaning dirty data?
- by running the script for cleaning provided in https://github.com/ipavlopoulos/toxic_spans/pull/4 (?)
Wouldn't it be easier to use this Github repo to allow for regular correction updates of the data?
Can we expect the final train data to be checked for invalid annotations?

I wouldn't mind some clarification on these topics.

ipavlopoulos commented 3 years ago

Thank you for bringing this up @GillesJ. No action will be taken for the time being, because preliminary experiments showed that by ignoring any invalid annotations, evaluation remains practically unchanged. We are validating this result with more experiments, but unless stated otherwise, the dataset will remain as it is.

GillesJ commented 3 years ago

Thanks for the fast response @ipavlopoulos, invalid annotations not affecting test scores is good to hear!

When you have more validation on this, perhaps this could be communicated via the google group. Until then, I will hold off closing this issue so your response stays up and others can read it.

sorensenjs commented 3 years ago

Giles - in this pull request https://github.com/ipavlopoulos/toxic_spans/pull/6 is a modified version of the spaCy based baseline. Running this 5 times shows the follow comparison of using the training and scoring on the trial data both with and without the fix_spans annotation.

The results of five runs avg F1 0.565999 0.566208 avg F1 0.589714 0.589923 avg F1 0.582272 0.58248 avg F1 0.596648 0.596857 avg F1 0.581425 0.581634 show that the difference attributable to these edits is close to negligible.

We discussed incorporating this into the scoring script and maybe recommending it, but we think that avoiding making this change keeps the playing field level for the participants.

That said, as we intend to release all of the data including the test set annotations at the end of the competition period, we agree that providing the most accurate possible annotations will be of ongoing value to the research community. Therefore we intend to version the annotations and update them based on feedback.

Unanswered is the question of whether using fix_spans or other hand edits is beneficial to the machine learning task. For me this is a choice I think best left to the competitors themselves, but there is still time before the competition starts if there is a compelling reason for the organizers to change our minds about this. I think once the leaderboard is active it would be unlikely that we would consider changes to the test set. But, as of this moment, we do not intend to make changes.

GillesJ commented 3 years ago

The main risk competition-wise is that some teams will not correct errors assuming hand-made gold standard don't have annotation artifacts. This could lead to performance difference not attributable to their modeling approach. In my opinion, annotation correction should not be regarded as part of the modeling like data augmentation. I would think the playing field would be more level with as-correct-annotations as possible.

Unless it is too much work organization-wise, I would personally err on the side of releasing corrected data up until leaderboard activation. That is what our research group has typically done in shared tasks we organized.

GillesJ commented 3 years ago

In short: Not distributing a fix is a duplication of effort for all participants, however easy it is to fix ourselves.

ipavlopoulos / toxic_spans

Clarify how invalid annotations will be handled #7