Add an additional property match context to the features

thadguidry commented 1 year ago

It would be useful to provide clients and for clients to filter on a match context in order to have more context on a candidate's score or condition. This should be a simple string to ease development pain for services that need something simple to start with.
Some services might even want to use different feature_view 's depending on a serviceVersion or even schemaSpace? But not sure about that case myself directly.
A standard client could also query 2 alternative services by a provider. (different service URLs and service manifests at each URL) and might produce different context and scores for the same data, type, properties supplied in the query.

Regardless, it is sometimes useful for clients to know the match context of candidates (or features of candidates) from a recon process against a service, if the service decides to provide a bit more context or information about a match score, an entity itself, or types or properties that were used or not used, etc. etc.

A match context provides a simple means of returning extra metadata or subdata about a match overall, and not necessarily about an individual feature, although it could also say much about that as well since the value type of context would simply be a String.

Example 1:

{
  "id": "1117582299",
  "name": "Urbaniak, Hans-Eberhard",
  "score": 85.71888,
  "context": "used v1.6 with weighting based on KL divergence" 
  "features": [
    {
      "id": "name_tfidf",
      "name": "TF-IDF score for the entity name",
      "value": 378.239
    },

Example 2:

{
  "id": "12345",
  "name": "generic drug label",
  "score": 55,
  "features": [
    {
      "id": "name_generic",
      "name": "baseline score for the label",
      "value": 133
      "context": "non-LSI, 1 matched broader type: generic" 
    },

I envision that context might be most useful at the candidate level in the first example, but perhaps also in features as in Example 2?

@wetneb let me know which parts above are unclear and I can update this. This is simple for a reason, and does not directly address larger client-service feedback loops, but a step in that direction for broader applicability and uptake by service providers. (hopefully)

fsteeg commented 1 year ago

This came up in our discussion of SSSOM, a spec for ontology mappings, which in addition to the plain mapping uses a predicate_id e.g. skos:exactMatch and a mapping_justification e.g. semapv:LexicalMatching, see this example.

Such a mapping_justification can conceptually replace a score, so this is kind of related to #127. And something like the predicate_id (kind of a match_relation), e.g. skos:exactMatch, could be provided in the client, when adding a property, maybe based on a suggest service, where a service could provide the predicates that make sense for the specific property, like locatedIn for geo-fields etc., something fuzzy for dates (see #114), etc.?

Both e.g. skos:exactMatch and semapv:LexicalMatching also provide "more context on a candidate's score or condition" (quoting from the issue description here), so I'm wondering if there could be some unified approach to address this? Like, could we allow the match field to be a string, which can contain details like semapv:LexicalMatching, skos:exactMatch; semapv:LexicalMatching or non-LSI, 1 matched broader type: generic? We'd probably need some specific use cases to sort this out.

thadguidry commented 1 year ago

So, I have given this some more thought. I think we can add 1 or 2 more properties to features. The reasoning is that features: value is either Boolean or Numerical in the current draft proposal. Which is indeed useful, but it misses the ability as I mentioned above in original comment that there's not a String or text property for a service to respond with a reason/description. So some options:

We could expand feature: value to also allow String.
- But this would then remove a pretty important utility on clients for doing a sort only on positive Numerical values and filtering out candidiates that are not above some minimum client criteria.
- Clients would have to be adapted to determine if feature: value is a String (does it have quotes or not), but likely not a big deal.
- String would be overloading the Type to much and mixing semantics with that of a feature: score/value because that's actually how its currently used, as a score "value: 10.329 or no score "value": false, but instead we just simplified with a label of value and not score. That decision might need to be looked over again (maybe it should be relabeled as score instead?) or our current draft needs some love to explain much more on semantics/meaning for features fields , such as "what does "value": false really mean for one feature in a response? Does it mean "feature not matched or not matched high enough to provide some Numerical value"? Does it mean "feature not even found or an underlying service lookup error for the feature"?

In light of that last bullet point, I still stand by the need to provide a context or we could label it as reason to provide a text string that likely can be easily built by service providers to provide some context on why a "value": false is produced, or why "value": 0 or a negative value, etc. etc.

@fsteeg 's idea of additionally giving justification for a feature match is along the same lines as my idea of providing context or reason. So I think the discussion should now be... what Types should be allowed for context/reason? If substructured lightly, then what would be some minimum fields necessary that clients might filter upon to surface only the best candidates? Those candidiates that matched all features, i.e. no features that have a false or 0 or negative value for feature: value?

His other idea is that of surfacing candidates that might match a predicate or semantic triple (SPO - subject, predicate, object). I like this idea, but it's already exposed directly through features where:

subject = reconciliation candidate entity
predicate = features: id and features: name
object = features: value

s:Bob p:is/age o:35 What is missing is a context/reason. s:Bob p:is/age o:35 reason: because he was born 35 years ago on Dec. 8, 1987 s:Bob p:is/age o:35 reason: because he is 1 year older than his brother who is claimed to be 34 s:France p:partOf o:EU reason: they signed legislative agreement XYZ in 1958, whatever

feature: reason should also allow for structure and could also provide spo statements and not just a String Type. For example, you can imagine that reason for France part of EU could be a set of semantic triples (or even quads if necessary). How that all might look like with more examples in a real structured response needs more thought by me and research. reason is 1 new field proposed, but let me drive a use case to see if a 2nd field is really needed additionally or not. Stay tuned.

thadguidry commented 1 year ago

Quick thought on how Freebase did some of that... it had &output=

Match "blade runner" and output disambiguating data (set of known properties) from matches in the /film/film domain.

filter=(all name:"Blade Runner")
&output=(disambiguator:/film/film)

Find restaurants within 1000ft of the SF Ferry Building and output their geocode and their type of cuisine.

filter=(all type:restaurant (within radius:1000ft lon:-122.39 lat:37.7955))
&output=(geocode practitioner_of)

Match "san francisco" and return all data in the location domain about it that is accessible via the output parameter.

filter=(all name{full}:"San Francisco" type:/location/citytown)
&output=(all:/location)
&limit=1

https://developers.google.com/freebase/v1/search-output

So maybe that's another thing... allowing clients to output all or specific properties from candidates? And not only type, score, features, match ? where features is a matching criteria set but has nothing to do with what might be output or requested additionally about candidates. Perhaps a new output field might be a good thing to add to reconciliation candidate responses? And that would help provide clients with more ways to self-determine or set their own match scoring algorithms or criteria how they want, instead of asking the service to use its own rules (and have to build them!).

tfmorris commented 1 year ago

While it's true that Freebase Search had this feature (and it was exposed in Freebase Suggest), Google Refine / OpenRefine never used it, as far as I'm aware.

@thadguidry Are you aware of other clients which made good use of this portion of the API(s)?

thadguidry commented 12 months ago

@tfmorris I am not aware

tfmorris commented 12 months ago

The introduction of SSOM into the discussion (@fsteeg ) makes me wonder if the reconciliation API is intended to support non-exact matches (e.g. broader/narrower/close). Historically, the goal has been exact matches only. Imperfect candidates may be returned, but by the time the user has reviewed and accepted a candidate it is, by definition, an exact match. I think solving for this use case should be the top priority since it's what the vast majority of users want to do.

fsteeg commented 12 months ago

Historically, the goal has been exact matches only.

I was thinking about these different match relations not on the entity level, but on a property level. Basically, for the entity to be an exact match, as you describe, we want properties to match in some specific way. Like: This 'Paris' here is (exactly) 'Paris, Texas', because it's location 'isContainedIn' Texas.

reconciliation-api / specs

Add an additional property match context to the features #128