full-text matching - Githubissues

I think that in many cases it will be useful to match rows by their textual content, using it as a general context for the entities in the KB.

NLP uses that all the time, eg "bank" as financial institution in an article about finance vs "bank" as a river feature in an article about nature or geography. Approaches include TF/IDF, word embeddings, etc.

Use cases:

VIAF only has fields "name, birth, death", but some alt names often include profession and occupation
WD recon could leverage Wikipedia abstracts (the text before the first heading), which are available in DBP

Implementation:

The recon server should collect a specified "text molecule" for each entity by navigating specified properties and paths, and expose it as prop "full text"
OntoRefine should allow the user to select a bunch (or all) columns and submit their text together.
Maybe we should allow for separation of text by language, eg "full text (en)" vs "full text (bg)"

reconciliation-api / specs

full-text matching #57