kermitt2 / entity-fishing

A machine learning tool for fishing entities
http://nerd.readthedocs.io/
Apache License 2.0
249 stars 24 forks source link

Add an option to retrieve a text only wikidata definition from entity ? #146

Open Lucaterre opened 2 years ago

Lucaterre commented 2 years ago

Hello @kermitt2 ,

I leave this feature/proposal here. Sorry, in advance if I used a wrong terminology (preprocess/clean etc.).

Currently, the query endpoint kb/concept/ returns the concept definition with a "Wikimedia" style markup.

output example for concept "Victor Hugo" :

'''''' (; 26 February 1802 – 22 May 1885) was a French poet, novelist, and dramatist of the [[Romanticism|Romantic movement]]. Hugo is considered to be one of the greatest and best-known French writers. Outside of France, his most famous works are the novels '''', 1862, and ''[[The Hunchback of Notre-Dame]]'', 1831. In France, Hugo is known primarily for his poetry collections, such as '''' (''The Contemplations'') and '''' (''The Legend of the Ages'').

A definition without specific markup, for example (Cf. https://en.wikipedia.org/wiki/Victor_Hugo) :

Victor-Marie Hugo (26 February 1802 – 22 May 1885) was a French poet, novelist, and dramatist of the Romantic movement. Hugo is considered to be one of the greatest and best-known French writers. Outside of France, his most famous works are the novels Les Misérables, 1862, and The Hunchback of Notre-Dame, 1831. In France, Hugo is known primarily for his poetry collections, such as The Contemplations and The Legend of the Ages.

I don't know if this is complicated to implement, but it could be considered in two different ways:

1) the user has the choice to retrieve a "clean" definition by adding an optional parameter, for example, something like: "raw":"true" or "clean":"true" for the kb/concept endpoint

2) In the answer add a "definition_raw" key (with wikimedia markup) and a "definition_clean" key (without markup)

I think it could be useful for people who need to work on additional features, here the definition, from the entities, without going through the addition of a textual preprocessing function.

What do you think about that ?

Regards, Lucas Terriel

kermitt2 commented 2 years ago

Hello @Lucaterre

Thanks for the issue.

Yes we can do this, so have plain text or the mediawiki format for the definition field which is set by a query parameter. The plain text method already exist:

https://github.com/kermitt2/entity-fishing/blob/master/src/main/java/com/scienceminer/nerd/utilities/mediaWiki/MediaWikiParser.java#L117

Lucaterre commented 2 years ago

Thank you for your answer ! Oh ok nice for a ready method :)

This is the idea indeed, to clarify my issue a little more (but I think that's what you said).

We consider an optional parameter query "plain_text" (maybe it's not the best param name here) set to "false" by default and which returns the definition in mediawiki format in the response.

Now if we imagine a request, such as:

$ curl 'https://cloud.science-miner.com/nerd/service/kb/concept/Q90?lang=fr?plain_text=true'

the response return a plain text definition instead of the definition in mediawiki format.

I don't know if there is any interest in keep both definitions (plain text and mediawiki) in the same response, it depends on the use case? (that's an open question)

kermitt2 commented 2 years ago

what about something like this:

$ curl 'https://cloud.science-miner.com/nerd/service/kb/concept/Q90?lang=fr&definition=mediawiki'

the definition parameter name is more precise for the expected behavior, as well as a non boolean value (which could be mediawiki (default), plain_text or maybe another one in the future). Maybe definition_format rather than definition ?

Lucaterre commented 2 years ago

I am agree, it seems definition_format is fine and more explicit as a parameter name than definition alone (which is confusing: the user may think that retrieving the definition is optional with this last name parameter).

Ok, with mediawiki as the default option of the parameter (this seems normal, this is the original format for the definition).

Just curious, what other "cross-mediawiki" formats do you think of in the future? HTML, Markdown for example?

kermitt2 commented 2 years ago

Just curious, what other "cross-mediawiki" formats do you think of in the future? HTML, Markdown for example?

yes I was thinking of these two possible formats.

kermitt2 commented 1 year ago

This is implemented with 2557847d086181fc900db1aa9182b1f1f19504cf

REST API parameter is definitionFormat with value Mediawiki (default) or PlainText (as requested in this issue). I am using Java notation for the parameter, because we are in the Java world in this project.

Example:

curl -X GET http://localhost:8090/service/kb/concept/Q190712?definitionFormat=PlainText
{ "rawName" : "First Battle of the Marne", "preferredTerm" : "First Battle of the Marne", "confidence_score":0, "wikipediaExternalRef":171325, "wikidataId" : "Q190712", "definitions" : [ { "definition" : "The First Battle of the Marne was a battle of the First World War fought from 5 to 12 September 1914. It was fought in a collection of skirmishes around the Marne River Valley. It resulted in an Entente victory against the German armies in the west. The battle was the culmination of the Retreat from Mons and pursuit of the Franco-British armies which followed the Battle of the Frontiers in August and reached the eastern outskirts of Paris.", "source" : "wikipedia-en", "lang" : "en" } ] ... }

https://nerd.readthedocs.io/en/latest/restAPI.html#get-kb-concept-id

kermitt2 commented 1 year ago

Also added html as format:

curl -X GET http://localhost:8090/service/kb/concept/Q190712?definitionFormat=html
{ "rawName" : "First Battle of the Marne", "preferredTerm" : "First Battle of the Marne", "confidence_score":0, "wikipediaExternalRef":171325, "wikidataId" : "Q190712", "definitions" : [ { "definition" : "<p>The <b>First Battle of the Marne</b> was a battle of the <a href=\"https://en.wikipedia.org/wiki/First_World_War\" title=\"First World War\">First World War</a> fought from 5 to 12 September 1914. It was fought in a collection of skirmishes around the Marne River Valley. It resulted in an <a href=\"https://en.wikipedia.org/wiki/Allies_of_World_War_I\" title=\"Allies of World War I\">Entente</a> victory against the <a href=\"https://en.wikipedia.org/wiki/German_Army_(German_Empire)\" title=\"German Army (German Empire)\">German</a> armies in the west. The battle was the culmination of the <a href=\"https://en.wikipedia.org/wiki/Retreat_from_Mons\" title=\"Retreat from Mons\">Retreat from Mons</a> and pursuit of the Franco-British armies which followed the <a href=\"https://en.wikipedia.org/wiki/Battle_of_the_Frontiers\" title=\"Battle of the Frontiers\">Battle of the Frontiers</a> in August and reached the eastern outskirts of Paris.<p>", "source" : "wikipedia-en", "lang" : "en" } ]  ... }