Closed ejls closed 5 years ago
Thanks for reporting these bugs @ejls how often do you see these bugs? I have been using the dataset for several papers before and I found very minor problems considering the offset.
If some offsets are correct then I assume this might be a problem of the tagger itself returning the wrong offset. I found similar issues with DBpedia spotlight for example.
I'm extracting triples where the subject/object annotators are Wikidata_Spotlight_Entity_Linker
and the predicate/triple annotators are not NoSubject-Triple-aligner
.
Under this constraints, 15 entities appearing in 24 triples end at the start of a word, they are:
docid | entity start | surface form with extra character |
---|---|---|
45217 | 11 | .303 |
172509 | 68 | .38 |
438613 | 206 | .303 |
795282 | 733 | .303 |
880735 | 658 | .303 |
1067915 | 154 | .303 |
1861222 | 47 | .303 |
2295297 | 1194 | .357 |
3783769 | 845 | .303 |
3783769 | 882 | .303 |
4346348 | 90 | .303 |
4545268 | 56 | .32 |
4871714 | 1868 | .303 |
5188080 | 103 | .177 |
7293234 | 551 | .303 |
It's easy to fix downstream once you know about it, so if it's too bothersome to fix it on your side you can close the issue. :)
Fair enough. Wikidata_spotlight is DBpedia spotlight with a sameas mapper. Unfortunately we cannot do anything about it it is a problem of DBpediaspotlight. Thanks for reporting that :)
I feel like the boundaries of some entities are wrong. The errors I found were on gun caliber entities, the end of the entity is set to the beginning of the following word.
For example in the file
re-nlg_1120000-1130000.json
document"docid": "http://www.wikidata.org/entity/Q3783769"
, there is the following sentence:The
firing .303 in
is cut into the following 4 words:However the right boundary of entity ".303" is on the
i
instead of the whitespace:You can find similar errors in the following documents:
Apart from this small mistake, it seems the dataset is rather clean, thank you for your work!