THD: need for getting a relevance score for the detected entities

jluisred commented 10 years ago

According to the THD v.2 documentation, THD does not expose any score regarding the relevance of the detected entities, in the way other extractors such as TextRazor or Alchemy do.

Those scores are useful and would help to better execute other processes such as Named Entity Expansion. Is there any plan to have something like this in THD as well?

rtroncy commented 10 years ago

Planned for the 25/08 new release of IRAPI according to the WP2 telecon minutes

m1ci commented 10 years ago

Hi, in the latest release of THD v3.9 we include such information, which we call "entity salience". Entities with high salience score play an important role in the story described in the document. For each entity, we assign:

entity salience class (most salient, less salient, not salient)
confidence (for the salience class)
salience score [0..1] - high salience score indicates higher focus of attention More about the entity salience you can read in the THD v2.0 documentation

Additionally, in the API output we also include confidence scores for the linking and classification!

giusepperizzo commented 10 years ago

cool job, thanks!

We (EURECOM folks) want to use these scores for deciding which one of the different types better classifies the entity. For instance, analyzing this text:

"Barack Hussein Obama II is the 44th and current President of the United States, and the first African American to hold the office. Born in Honolulu, Hawaii, Obama is a graduate of Columbia University and Harvard Law School, where he served as president of the Harvard Law Review."

THD classifies the entity "Harvard Law School" with 6 different types. If we pick up the types with the highest classificationConfidence score (0.857) we end up having two types: Schools, Agent. These two types have also identical linkingConfidence scores, as well as the salience values.

Now, we are a bit puzzled. How can we decide which type better defines the entity class? Is the order in which the types are packed by THD relevant?

m1ci commented 10 years ago

Hi, let me first clear what the values mean:

linkingConfidence: estimated probability that entity mention is correctly linked with the DBpedia resource
classificationConfidence: estimated probability that the type is correct for given DBpedia resource (entityURI)
salience: estimated salience of the entity to the document, or to which extend the entity is in the focus of attention in the document.

Back to your question Q: "How can we decide which type better defines the entity class?" - well, maybe you can go for the most specific DBpedia Ontology type. So, in the scenario with the two types "Schools, Agent" you can just pick the most specific one.

Q: Is the order in which the types are packed by THD relevant? - Sure not. :) Please do not rely on it.

giusepperizzo commented 10 years ago

Thanks, crystal clear.

Still annotating the sentence above, one of the types is as follows:

{"typeLabel":"Schools", "typeURI":"http://dbpedia.org/resource/Schools", "entityURI":"http://dbpedia.org/resource/Harvard_Law_School", ...}

Schools does not seem to be a DBpedia type (as far I've seen from DBpedia 3.7+) and, indeed, the typeURI points to a DBpedia resource. It might be a bug somewhere in the pipe. Am I mistaken?

m1ci commented 10 years ago

Yes, "Schools" is not a DBpedia Ontology type. However, the types returned by THD can be from the DBpedia instances or DBpedia Ontology namespace. If you need only DBpedia Ontology types you can filter out these types by setting the query parameter: "types_filter=dbo", or "types_filter=dbinstance" for types as DBpedia instances, or "types_filter=all" for both ;)

giusepperizzo commented 10 years ago

Well this is what I tried before.

Click and play here (before s/KEY/yourkey). Something wrong?

curl -v "http://ner.vse.cz/thd/api/v1/extraction?apikey=KEY&format=json&provenance=thd,dbpedia&priority_entity_linking=true&entity_type=all&types_filter=dbo" -d "Barack Hussein Obama II is the 44th and current President of the United States, and the first African American to hold the office. Born in Honolulu, Hawaii, Obama is a graduate of Columbia University and Harvard Law School, where he served as president of the Harvard Law Review. "

m1ci commented 10 years ago

I see. Its a bug. Will look into it. Thanks for reporting!

m1ci commented 10 years ago

The bug is solved. Please check. BTW, you are using old version v1 of the API. Please switch to version v2. The endpoint of the new API is: http://ner.vse.cz/thd/api/v2/extraction

rtroncy commented 10 years ago

AFAIK, we did switch to the version 2 of the API months ago! @jluisred can you please confirm since you did the switch? Please, don't close this issue until it is really solved, i.e. solved in the LinkedTV platform, so also in NERD.

jluisred commented 10 years ago

Yes, NERD switched to THD v2 some time ago, the former was just an example from @giusepperizzo . We'll propagate the change asap.

giusepperizzo commented 10 years ago

Just as clarification: NERD points to https://entityclassifier.eu/thd/api/v2/extraction, that as far as I can see from the traceroute is resolved as 146.102.167.46, same IP address of http://ner.vse.cz/thd/api/v2/extraction. I hope it's also the same virtual host. Isn't it? Hence, NERD does use the v2. In the example above (curl) I erroneously used the v1 , my fault sorry.

Anyway, now I reckon that there is something wrong in the v2. Pls check (before s/YOURKEY/key):

curl -v "https://entityclassifier.eu/thd/api/v2/extraction?apikey=YOURKEY&format=json&provenance=thd,dbpedia&priority_entity_linking=true&entity_type=all&types_filter=dbo" -d "Barack Hussein Obama II is the 44th and current President of the United States, and the first African American to hold the office. Born in Honolulu, Hawaii, Obama is a graduate of Columbia University and Harvard Law School, where he served as president of the Harvard Law Review. "

Connection #0 to host ner.vse.cz left intact

m1ci commented 10 years ago

@giusepperizzo please check again. It should be everything OK now.

rtroncy commented 10 years ago

Thanks @m1ci. We have updated the public instance of NERD as well which closes this bug. The implemented logic is: we use the classificationConfidence score to select the most appropriate type for NERD before linking to the NERD ontology.

Have a look at http://nerd.eurecom.fr/annotation/1521982 which has been analyzed with THD. "Pentagon" and "Turkish" are two surface forms you have extracted without providing type or disambiguation URI. The most surprising thing for me is the surface form "British Foreign Secretary Philip Hammond" that you disambiguate against http://dbpedia.org/resource/William_Hague who was the previous Foreign Secretary while the current one is http://dbpedia.org/resource/Philip_Hammond! Do you know why?

m1ci commented 10 years ago

Thanks @m1ci. We have updated the public instance of NERD as well which closes this bug.

Happy to hear this.

The implemented logic is: we use the classificationConfidence score to select the most appropriate type for NERD before linking to the NERD ontology.

OK. However, I think a better option is to go with the "most specific" in the DBpedia Ontology. But, maybe this is not what you want to achieve.

"Pentagon" and "Turkish" are two surface forms you have extracted without providing type or disambiguation URI.

In near feature we will integrate a more efficient entity linking approach which we evaluate at TAC'14. I just checked, and with the this new approach "Pentagon" should be linked to http://dbpedia.org/resource/The_Pentagon and "Kurdish" with http://dbpedia.org/page/Kurdish_people

The most surprising thing for me is the surface form "British Foreign Secretary Philip Hammond" that you disambiguate against http://dbpedia.org/resource/William_Hague who was the previous Foreign Secretary while the current one is http://dbpedia.org/resource/Philip_Hammond! Do you know why?

I believe the difference is due to the different linking approaches used in both cases. Some time ago we included new feature in THD, so clients can choose the entity linking method. THD currently, supports two linking methods (see linking_method parameter http://entityclassifier.eu/thd/docs/api/v2/). While in the past was only used "Wikipedia Search" based linking, now you can also perform "Lucene" based linking. I believe NERD does not explicitly specifies the linking method, so by default is used Lucene, which unfortunately incorrectly disambiguates the surface form. If you switch to "Wikipedia Search" based linking, the surface is correctly disambiguated. According to our observations, "Lucene" based linking performs better, but, you know, there will be always cases where one linking method is worse than another :) This should explain the situation.

giusepperizzo commented 10 years ago

The implemented logic is: we use the classificationConfidence score to select the most appropriate type for NERD before linking to the NERD ontology.

OK. However, I think a better option is to go with the "most specific" in the DBpedia Ontology. But, maybe this is not what you want to achieve.

We favor precision in typing, this is why we sort it according to the classificationConfidence score. In case of multiple types that have the same highest classificationConfidence score (that occurs pretty often) we select the most specific type.

About the linking, yep I confirm that we use LuceneSearch.

Thanks for your answers and thanks for your work.

linkedtv / wp2

THD: need for getting a relevance score for the detected entities #20