linkedtv / wp2

0 stars 0 forks source link

THD: wrong disambiguation of the entity "Martena Museum" #1

Closed lyndonnixon closed 10 years ago

lyndonnixon commented 10 years ago

"Martena Museum" in the transcript was labeled with http://nl.dbpedia.org/resource/Museum (by THD) even though http://nl.dbpedia.org/resource/Museum_Martena exists and "Martena Museum" redirects to it.

rtroncy commented 10 years ago

I'm not sure I understand the issue: http://nl.dbpedia.org/resource/Museum_Martena and http://nl.dbpedia.org/resource/Museum are two different resources, and I don't see any redirect between those two resources.

It seems to me that the issue is wrongly labeled, but that you're basically disappointed that THD didn't provide a good enough disambiguation URI for the entity it has detected. Please, comment @lyndonnixon

kliegr commented 10 years ago

After the last THD update,Museum Martena is now correctly disambiguated as http://nl.dbpedia.org/resource/Museum_Martena, and assigned type http://dbpedia.org/ontology/Museum. You can verify at http://ner.vse.cz/thd/ by providing input "Museum Martena". Lyndon, can you check with your input and close the issue if the result is O.K.?

lyndonnixon commented 10 years ago

@rtroncy Sorry if my issue was unclear. I supposed THD is matching strings to terms in DBPedia and YAGO, so could detect "Museum" as http://nl.dbpedia.org/resource/Museum but not "Martena Museum" because the DBPedia entity is labelled Museum Martena (http://nl.dbpedia.org/resource/Museum_Martena).

lyndonnixon commented 10 years ago

@kliegr Actually the issue remains that the transcript has "Martena Museum" and this is not detected. Here is the original part of the subtitles which I just tested again with THD (below) and the string "Martena Museum" is labelled with http://nl.dbpedia.org/resource/Museum. I acknowledge this is a typical issue in entity extraction and just wondering if there are ideas how to fix it.

Deze meneer, Hessel van Martena,
was de eerste bewoner van wat nu het Martena Museum is in Franeker.

rtroncy commented 10 years ago

@lyndonnixon Indeed, the issue was not clear. Your first assumption is a bit naive. The role of an named entity extractor is not just to do the simple job of matching string with a dictionary/gazetter/knowledge base but generally involves POS tagging and heavier NLP processing.

What you criticize is that none of the NER tool did correctly detect and disambiguate "Martena Museum". Doing a quick experiment with the text you provided:

@kliegr You told us this morning that the new THD deployed a month ago was better in disambiguating this string but it is not the case, at least via NERD. Can you please elaborate?

rtroncy commented 10 years ago

@kliegr Any update on this issue? The recent test we performed sort of contradicts what you said during the 05/08/2014 Telecon and THD seems to still wrongly disambiguate this resource.

The fact that an extractor made a mistake is normal. I'm more worried that you got different results than we do when using the same tool. Can you track down the problem?

kliegr commented 10 years ago

I must have made a mistake when testing whether this problem is resolved. Let's leave this issue open for now, we will reinvestigate it.

m1ci commented 10 years ago

Hi, With the latest improvements in THD (release 3.9), this issue is solved and the entity "Martena Museum" is correctly disambiguated as http://nl.dbpedia.org/resource/Museum_Martena

Test request: curl -X POST -d "Martena Museum" "http://entityclassifier.eu/thd/api/v2/extraction?lang=nl&entity_type=all&provenance=thd&types_filter=dbo&knowledge_base=linkedHypernymsDataset&apikey=xxxxxxxxxx&format=json"

rtroncy commented 10 years ago

@m1ci Thanks. I run this sentence in NERD and I have selected the THD extractor, see the results at http://nerd.eurecom.fr/annotation/1516114.

Indeed, now, the surface form "Martena Museum" is now disambiguated with http://nl.dbpedia.org/resource/Museum_Martena.

However, the first occurrence "Hessel van Martena" is split into "Hessel" (wrong person) and "Martena" (again disambiguated with the museum), instead of having this complex surface form being disambiguated with the person http://nl.dbpedia.org/resource/Hessel_Martena. Can you please comment on this?

m1ci commented 10 years ago

Well, this is a mistake made by our entity spotting approach.

Maybe we can close this "bug" issue and open new "enhancement" issue: "entity spotting improvement for NL" with a reference on the "Hessel van Martena" case?

rtroncy commented 10 years ago

OK, I understand that tools are not perfect, and will always make errors, so this was not my concern. I don't want that THD is being optimized for this particular sentence. What I'm worry about is the lack of proper evaluation of THD on the LinkedTV scenario. Is this something you will provide in 2.6?

I'm happy to close this issue, but I would like that for a few LinkedTV seed video programs and chapters, e.g. the 6 chapters defined in the LinkedCulture scenario, there is some human assessment of the THD results, not at the point of computing precision and recall, but enough to show that you have a real feeling of what the performances are. Can you provide this? If the Dutch language is an issue, you can also rely on Lotte for help.