sebastianruder / NLP-progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
https://nlpprogress.com/
MIT License
22.73k stars 3.62k forks source link

a relationship extraction issue #220

Open karlhugle opened 5 years ago

karlhugle commented 5 years ago

for this list https://github.com/sebastianruder/NLP-progress/blob/master/english/relationship_extraction.md

I would like to point out a data issue

a new model of Distantly Supervised Relationship Extraction using the same training dataset (522611 ) is be able to compare with the same results of models (PCNN+ATT, PCNN+ONE etc.) reported Lin's paper (Lin et al., 2016).
(the cleaned dataset was updated by Lin and could be downloaded from https://github.com/thunlp/NRE)

The problem is that, some new papers (e.g. two in EMNLP 2018 and one in AAAI2019) ) used the unprocessed data (570088), which contains duplicated instances in the test set. the unclean data will give higher unreliable results.

issues already have been discussed in https://github.com/thunlp/NRE/issues/16 https://github.com/thunlp/OpenNRE/issues/27

the unclean data was tested and has effects on the results.

sebastianruder commented 5 years ago

Thanks for this note! Could you add a note to the relevant section and indicate with a symbol the methods that use a different setup?

weilonghu commented 5 years ago

I also noticed this problem and emailed some authors, but they have different opinions on this issue.