Open thadguidry opened 1 year ago
This came up in our discussion of SSSOM, a spec for ontology mappings, which in addition to the plain mapping uses a predicate_id
e.g. skos:exactMatch
and a mapping_justification
e.g. semapv:LexicalMatching
, see this example.
Such a mapping_justification
can conceptually replace a score, so this is kind of related to #127. And something like the predicate_id
(kind of a match_relation
), e.g. skos:exactMatch
, could be provided in the client, when adding a property, maybe based on a suggest service, where a service could provide the predicates that make sense for the specific property, like locatedIn
for geo-fields etc., something fuzzy for dates (see #114), etc.?
Both e.g. skos:exactMatch
and semapv:LexicalMatching
also provide "more context on a candidate's score or condition" (quoting from the issue description here), so I'm wondering if there could be some unified approach to address this? Like, could we allow the match
field to be a string, which can contain details like semapv:LexicalMatching
, skos:exactMatch; semapv:LexicalMatching
or non-LSI, 1 matched broader type: generic
? We'd probably need some specific use cases to sort this out.
So, I have given this some more thought. I think we can add 1 or 2 more properties to features
. The reasoning is that features: value
is either Boolean or Numerical in the current draft proposal. Which is indeed useful, but it misses the ability as I mentioned above in original comment that there's not a String
or text
property for a service to respond with a reason/description. So some options:
feature: value
to also allow String
.
feature: value
is a String
(does it have quotes or not), but likely not a big deal.String
would be overloading the Type to much and mixing semantics with that of a feature: score/value
because that's actually how its currently used, as a score "value: 10.329
or no score "value": false
, but instead we just simplified with a label of value
and not score
. That decision might need to be looked over again (maybe it should be relabeled as score
instead?) or our current draft needs some love to explain much more on semantics/meaning for features
fields , such as "what does "value": false
really mean for one feature in a response? Does it mean "feature not matched or not matched high enough to provide some Numerical value"? Does it mean "feature not even found or an underlying service lookup error for the feature"?In light of that last bullet point, I still stand by the need to provide a context
or we could label it as reason
to provide a text string that likely can be easily built by service providers to provide some context on why a "value": false
is produced, or why "value": 0
or a negative value, etc. etc.
@fsteeg 's idea of additionally giving justification for a feature match is along the same lines as my idea of providing context
or reason
.
So I think the discussion should now be... what Types should be allowed for context/reason
? If substructured lightly, then what would be some minimum fields necessary that clients might filter upon to surface only the best candidates? Those candidiates that matched all features, i.e. no features that have a false
or 0 or negative value for feature: value
?
His other idea is that of surfacing candidates that might match a predicate or semantic triple (SPO - subject, predicate, object). I like this idea, but it's already exposed directly through features
where:
features: id
and features: name
features: value
s:Bob p:is/age o:35
What is missing is a context/reason
.
s:Bob p:is/age o:35 reason: because he was born 35 years ago on Dec. 8, 1987
s:Bob p:is/age o:35 reason: because he is 1 year older than his brother who is claimed to be 34
s:France p:partOf o:EU reason: they signed legislative agreement XYZ in 1958, whatever
feature: reason
should also allow for structure and could also provide spo statements and not just a String
Type.
For example, you can imagine that reason
for France part of EU could be a set of semantic triples (or even quads if necessary).
How that all might look like with more examples in a real structured response needs more thought by me and research.
reason
is 1 new field proposed, but let me drive a use case to see if a 2nd field is really needed additionally or not.
Stay tuned.
Quick thought on how Freebase did some of that... it had &output=
Match "blade runner" and output disambiguating data (set of known properties) from matches in the /film/film domain.
filter=(all name:"Blade Runner")
&output=(disambiguator:/film/film)
Find restaurants within 1000ft of the SF Ferry Building and output their geocode and their type of cuisine.
filter=(all type:restaurant (within radius:1000ft lon:-122.39 lat:37.7955))
&output=(geocode practitioner_of)
Match "san francisco" and return all data in the location domain about it that is accessible via the output parameter.
filter=(all name{full}:"San Francisco" type:/location/citytown)
&output=(all:/location)
&limit=1
https://developers.google.com/freebase/v1/search-output
So maybe that's another thing... allowing clients to output all or specific properties from candidates?
And not only type
, score
, features
, match
? where features
is a matching criteria set but has nothing to do with what might be output or requested additionally about candidates. Perhaps a new output
field might be a good thing to add to reconciliation candidate responses? And that would help provide clients with more ways to self-determine or set their own match scoring algorithms or criteria how they want, instead of asking the service to use its own rules (and have to build them!).
While it's true that Freebase Search had this feature (and it was exposed in Freebase Suggest), Google Refine / OpenRefine never used it, as far as I'm aware.
@thadguidry Are you aware of other clients which made good use of this portion of the API(s)?
@tfmorris I am not aware
The introduction of SSOM into the discussion (@fsteeg ) makes me wonder if the reconciliation API is intended to support non-exact matches (e.g. broader/narrower/close). Historically, the goal has been exact matches only. Imperfect candidates may be returned, but by the time the user has reviewed and accepted a candidate it is, by definition, an exact match. I think solving for this use case should be the top priority since it's what the vast majority of users want to do.
Historically, the goal has been exact matches only.
I was thinking about these different match relations not on the entity level, but on a property level. Basically, for the entity to be an exact match, as you describe, we want properties to match in some specific way. Like: This 'Paris' here is (exactly) 'Paris, Texas', because it's location 'isContainedIn' Texas.
feature_view
's depending on aserviceVersion
or evenschemaSpace
? But not sure about that case myself directly.Regardless, it is sometimes useful for clients to know the match context of candidates (or features of candidates) from a recon process against a service, if the service decides to provide a bit more context or information about a match score, an entity itself, or types or properties that were used or not used, etc. etc.
A match
context
provides a simple means of returning extra metadata or subdata about a match overall, and not necessarily about an individual feature, although it could also say much about that as well since the value type ofcontext
would simply be aString
.Example 1:
Example 2:
I envision that
context
might be most useful at the candidate level in the first example, but perhaps also infeatures
as in Example 2?@wetneb let me know which parts above are unclear and I can update this. This is simple for a reason, and does not directly address larger client-service feedback loops, but a step in that direction for broader applicability and uptake by service providers. (hopefully)