Closed wetneb closed 7 months ago
Yes, and it's partly related to my use cases defined in #88 I would say. One interpretation is that
{
"query": "John Doe",
"properties": [
{"pid": "P123", "v": "first value"},
// How would a qualifier or reference for this statement look like?
// How would a client more easily form a query
// having "This statement MUST INCLUDE a reference from government of France"?
{"pid": "P123", "v": "second value"}
]
}
So there are some semantic differences therein that apply to filter constraints like I mentioned for conditional evaluation with OR, AND, or MUST INCLUDE, SHOULD INCLUDE, those kinds of things. Seems like asking "P123 SHOULD BE" versus "P123 SHOULD INCLUDE" might apply differently?
Can you form a more wider fuller query example where it asks more questions about a property statement(s) to allow us to more easily talk about pros cons diffs and inspect it?
In my use case I don't need to have more than one property ID in the properties list. This issue is raised by the spec design choice of a list of object rather than an object of properties keyed by property_id.
In my opinion as long as there is no boolean/conditional modifiers in the query, allowing duplicated property ids in the properties array is more a bug than a feature.
Now let's consider the spec will introduce conditional modifiers. The question is then would allowing duplicated property id opens some advanced query?
I can see many different ways to introduce conditional modifiers:
But I don't know what kind of conditional modifiers complexity is actually needed by reconciliation use cases (i.e. reconciliation API should probably not try to reproduce the capabilities of SPARQL language). In my case OR everywhere backed by a powerful scoring mechanism is enough (default behaviour).
On top of my head I would rather think that conditional modifiers could be added at each properties in two flavors:
But again please take this as nothing more as broad ideas I just made up.
@wetneb do you store and have access to the Wikidata reconciliation service query log? To know a little bit more about what kind of queries are used.
[...] each properties [...] 1. a quantitative
[0,1]
weight [and] 2. array status:"all"|"any"|"none"
This sounds very good! In a recent reconciliation setup we did a lot of service-internal tweaking of field-level boosting settings and I noticed that this was the only parameter that was not accessible from the client. A weight would make that possible, and at the same time could be used to express not
(0), must
(1), and should
(e.g. 0.5). With additional "all"|"any"|"none"
for multiple values this should support very flexible queries.
I think it would also address #88 and #101, and with #106, would be accessible for the query
too. I think it would even solve our use case for a complex type
that I mentioned in https://github.com/OpenRefine/OpenRefine/issues/5615#issuecomment-1424496013, basically:
(Series OR Journal OR Periodical) AND NOT (Article OR PublicationIssue)
Which we could replace with properties like this:
"properties": [
{"pid": "type", "v": ["Series", "Journal", "Periodical"], "match": "any", "weight": 1}
{"pid": "type", "v": ["Article", "PublicationIssue"], "match": "none", "weight": 1}
]
This example also shows how we'd use both multiple property statements for the same pid
and multiple values for each statement.
Oh whoa the last example looks indeed pretty sound and definitely close the discussion about the pid
duplication.
Yet the spec documentation might indicate that pid
duplications without or with all same weights and match are equivalent :
"properties": [
{"pid": "type", "v": ["Series", "Journal", "Periodical"], "match": "any", "weight": 1}
{"pid": "type", "v": ["Article", "PublicationIssue"], "match": "any", "weight": 1}
]
is the same as
"properties": [
{"pid": "type", "v": ["Series", "Journal", "Periodical", "Article", "PublicationIssue"], "match": "any", "weight": 1}
]
We discussed this in yesterday's meeting, where Antonin brought up that my use case for the weights would be better solved by re-scoring reconciliation results locally in the client, based on the matching features returned by the service. Passing weights instead requires a full rerun of the reconciliation for every tweak of the weights. He also questioned the semantics of 0
meaning not
, since a zero weight seems to imply that the property can be ignored. This both points to using the should/must/not
semantics instead of weights.
We further discussed if having both not
(from should/must/not
) and none
(from any/all/none
) makes sense and were under the impression that none
could be used whenever we want to negate (treating single values as 1-element arrays).
So the proposal here would be to add two dimensions: any/all/none
and should/must
, as optional fields of properties, e.g. for my example above:
"properties": [
{"pid":"type", "v":["Series", "Journal", "Periodical"], "match":"any", "mode":"must"}
{"pid":"type", "v":["Article", "PublicationIssue"], "match":"none", "mode":"must"}
]
Further questions would be defaults (any
and should
?) and naming (match
and mode
?).
the spec documentation might indicate that pid duplications without or with all same weights[/mode] and match are equivalent
Yes, good point.
Yes the re-run issue is a big one. But this issue should not close the discussion here because tweaking on client side bring this other issue : how are candidate pre-selected? If properties are used to propose a "default" score than the issue remain even if clients can tweak ex-post. See my comment in the date comparison issue: https://github.com/reconciliation-api/specs/issues/114#issuecomment-1471516654
This discussion is important to take to be able to decide how the recon API should let users tweak default scores or not.
To come back to the turning mode
into a qualitative parameter proposal.
How would the recon API use this mode
parameter? More specifically how the should
would be used to score candidates?
More specifically how the should would be used to score candidates?
Unless I'm missing something I think the idea is that we leave scoring/weighting out of this completely. Just boolean logic (any/all/none
) and optional/non-optional (should/must
). With the difference that (1) changes inclusion in results vs (2) keeps all results, just changes scoring (somehow). I also think of it as a filter (must
) vs. non-filter (should
) behavior. How exactly scoring is affected in a should
query would be completely up to the service (but could be reported back using matching features).
The specs currently do not forbid the following query:
In other words, we do not say whether the value of the
pid
field is expected to be unique in theproperties
array.It is unclear to me if such a query should instead be reformulated as:
I guess services could interpret both queries differently for matching purposes, perhaps giving a use to both queries? In any case, it would be worth talking about that in the specs.
Brought up by @paulgirard in today's OpenRefine contributors meetup.