hadyelsahar / RE-NLG-Dataset

T-Rex : A Large Scale Alignment of Natural Language with Knowledge Base Triples
MIT License
64 stars 12 forks source link

Wrong entity boundary ".303␠" #1

Closed ejls closed 5 years ago

ejls commented 5 years ago

I feel like the boundaries of some entities are wrong. The errors I found were on gun caliber entities, the end of the entity is set to the beginning of the following word.

For example in the file re-nlg_1120000-1130000.json document "docid": "http://www.wikidata.org/entity/Q3783769", there is the following sentence:

The armament of the aircraft was one fixed forward firing .303 in (7.7 mm) Vickers gun [...]

The firing .303 in is cut into the following 4 words:

       .↓   in↓
       ↓↓   ↓ ↓
firing .303 in
↑     ↑ ↑  ↑
firing↑ 303↑

However the right boundary of entity ".303" is on the i instead of the whitespace:

firing .303 in
       ↑    ↑
       .303?↑

You can find similar errors in the following documents:

Apart from this small mistake, it seems the dataset is rather clean, thank you for your work!

hadyelsahar commented 5 years ago

Thanks for reporting these bugs @ejls how often do you see these bugs? I have been using the dataset for several papers before and I found very minor problems considering the offset.

If some offsets are correct then I assume this might be a problem of the tagger itself returning the wrong offset. I found similar issues with DBpedia spotlight for example.

ejls commented 5 years ago

I'm extracting triples where the subject/object annotators are Wikidata_Spotlight_Entity_Linker and the predicate/triple annotators are not NoSubject-Triple-aligner. Under this constraints, 15 entities appearing in 24 triples end at the start of a word, they are:

docid entity start surface form with extra character
45217 11 .303
172509 68 .38
438613 206 .303
795282 733 .303
880735 658 .303
1067915 154 .303
1861222 47 .303
2295297 1194 .357
3783769 845 .303
3783769 882 .303
4346348 90 .303
4545268 56 .32
4871714 1868 .303
5188080 103 .177
7293234 551 .303

It's easy to fix downstream once you know about it, so if it's too bothersome to fix it on your side you can close the issue. :)

hadyelsahar commented 5 years ago

Fair enough. Wikidata_spotlight is DBpedia spotlight with a sameas mapper. Unfortunately we cannot do anything about it it is a problem of DBpediaspotlight. Thanks for reporting that :)