String2vocabulary: fuzzy matching could improve some of the linking

silknow / converter

SILKNOW converter that harmonizes all museum metadata records into the common SILKNOW ontology model (based on CIDOC-CRM)

Apache License 2.0

1 stars 0 forks source link

String2vocabulary: fuzzy matching could improve some of the linking #30

Closed tschleider closed 3 years ago

tschleider commented 4 years ago

Multiple candidate links to replace strings should be enabled in some occasions.

rtroncy commented 3 years ago

I'm seeing the need but I'm not sure this should be included in string2vocabulary. What do you think @pasqLisena?

pasqLisena commented 3 years ago

Indeed this is not provided in String2Vocabulary. You want something like this, right? https://overture.doremus.org/api/vocabulary/mop?q=pianofrte So that you receive some results even if there are typos (like in pianofrte)

rtroncy commented 3 years ago

Yes, something like this! This is typically what a reconciliation API does, see the W3C Community Group which is being relatively active and the current specification.

In the case of the API you setup for doremus, how is the confidence score established? I can imagine a lucene/solr like indexing of the thesaurus.

pasqLisena commented 3 years ago

In the case of the API you setup for doremus, how is the confidence score established? I can imagine a lucene/solr like indexing of the thesaurus.

Much simpler! The labels are not so much and can be cached in memory (we are speaking about vocabularies only). Then, a Levenshtein distance is applied for proposing the results, with some custom changes (preflabels have higher scores than altlabels, matching the right language give more score, etc.)

https://github.com/DOREMUS-ANR/overture/blob/e77a0561d566cd489aebfb374b046183be477cf1/server/api/vocabulary/lemma.js#L20-L36

I may try to implement a similar logic to string2vocabulary. Take into account that this can be error-prone! (example if two material labels are similar, or if a material and a technique have similar labels)

rtroncy commented 3 years ago

The threshold used in the Levenstein distance is then hardcoded (0.6 / 0.7?) Should this not be parametrized?

This is related anyway to https://github.com/silknow/api/issues/1

pasqLisena commented 3 years ago

Indeed this can be a parameter. Anyway, those values are the best ones accordingly to our experiments and prioritising precision over recall