reconciliation-api / specs

Specifications of the reconciliation API
https://reconciliation-api.github.io/specs/draft/
30 stars 9 forks source link

Supplying multiple properties with the same pid in a reconciliation query #105

Closed wetneb closed 7 months ago

wetneb commented 1 year ago

The specs currently do not forbid the following query:

{
    "query": "John Doe",
    "properties": [
         {"pid": "P123", "v": "first value"},
         {"pid": "P123", "v": "second value"}
     ]
}

In other words, we do not say whether the value of the pid field is expected to be unique in the properties array.

It is unclear to me if such a query should instead be reformulated as:

{
    "query": "John Doe",
    "properties": [
         {"pid": "P123", "v": ["first value", "second value"]}
     ]
}

I guess services could interpret both queries differently for matching purposes, perhaps giving a use to both queries? In any case, it would be worth talking about that in the specs.

Brought up by @paulgirard in today's OpenRefine contributors meetup.

thadguidry commented 1 year ago

Yes, and it's partly related to my use cases defined in #88 I would say. One interpretation is that

{
    "query": "John Doe",
    "properties": [
         {"pid": "P123", "v": "first value"}, 
                // How would a qualifier or reference for this statement look like?
                // How would a client more easily form a query
                // having "This statement MUST INCLUDE a reference from government of France"?
         {"pid": "P123", "v": "second value"}
     ]
}

So there are some semantic differences therein that apply to filter constraints like I mentioned for conditional evaluation with OR, AND, or MUST INCLUDE, SHOULD INCLUDE, those kinds of things. Seems like asking "P123 SHOULD BE" versus "P123 SHOULD INCLUDE" might apply differently?

Can you form a more wider fuller query example where it asks more questions about a property statement(s) to allow us to more easily talk about pros cons diffs and inspect it?

paulgirard commented 1 year ago

In my use case I don't need to have more than one property ID in the properties list. This issue is raised by the spec design choice of a list of object rather than an object of properties keyed by property_id.

In my opinion as long as there is no boolean/conditional modifiers in the query, allowing duplicated property ids in the properties array is more a bug than a feature.

Now let's consider the spec will introduce conditional modifiers. The question is then would allowing duplicated property id opens some advanced query?

I can see many different ways to introduce conditional modifiers:

  1. at the property level to specify how the multiple values provided should be used
  2. at the query level to inform how the many properties should be applied in the scoring
  3. in a compound language such as in mongo queries https://www.mongodb.com/docs/atlas/atlas-search/compound/

But I don't know what kind of conditional modifiers complexity is actually needed by reconciliation use cases (i.e. reconciliation API should probably not try to reproduce the capabilities of SPARQL language). In my case OR everywhere backed by a powerful scoring mechanism is enough (default behaviour).

On top of my head I would rather think that conditional modifiers could be added at each properties in two flavors:

  1. adding a either qualitative "should"(default)|"must"|"not" condition or a quantitative [0,1] weight for the property
  2. adding an other one to indicate the property values array status : "all"|"any"|"none"

But again please take this as nothing more as broad ideas I just made up.

@wetneb do you store and have access to the Wikidata reconciliation service query log? To know a little bit more about what kind of queries are used.

fsteeg commented 1 year ago

[...] each properties [...] 1. a quantitative [0,1] weight [and] 2. array status: "all"|"any"|"none"

This sounds very good! In a recent reconciliation setup we did a lot of service-internal tweaking of field-level boosting settings and I noticed that this was the only parameter that was not accessible from the client. A weight would make that possible, and at the same time could be used to express not (0), must (1), and should (e.g. 0.5). With additional "all"|"any"|"none" for multiple values this should support very flexible queries.

I think it would also address #88 and #101, and with #106, would be accessible for the query too. I think it would even solve our use case for a complex type that I mentioned in https://github.com/OpenRefine/OpenRefine/issues/5615#issuecomment-1424496013, basically:

(Series OR Journal OR Periodical) AND NOT (Article OR PublicationIssue)

Which we could replace with properties like this:

"properties": [
    {"pid": "type", "v": ["Series", "Journal", "Periodical"], "match": "any", "weight": 1}
    {"pid": "type", "v": ["Article", "PublicationIssue"], "match": "none", "weight": 1}
]

This example also shows how we'd use both multiple property statements for the same pid and multiple values for each statement.

paulgirard commented 1 year ago

Oh whoa the last example looks indeed pretty sound and definitely close the discussion about the pid duplication. Yet the spec documentation might indicate that pid duplications without or with all same weights and match are equivalent :

"properties": [
    {"pid": "type", "v": ["Series", "Journal", "Periodical"], "match": "any", "weight": 1}
    {"pid": "type", "v": ["Article", "PublicationIssue"], "match": "any", "weight": 1}
]

is the same as

"properties": [
    {"pid": "type", "v": ["Series", "Journal", "Periodical", "Article", "PublicationIssue"], "match": "any", "weight": 1}
]
fsteeg commented 1 year ago

We discussed this in yesterday's meeting, where Antonin brought up that my use case for the weights would be better solved by re-scoring reconciliation results locally in the client, based on the matching features returned by the service. Passing weights instead requires a full rerun of the reconciliation for every tweak of the weights. He also questioned the semantics of 0 meaning not, since a zero weight seems to imply that the property can be ignored. This both points to using the should/must/not semantics instead of weights.

We further discussed if having both not (from should/must/not) and none (from any/all/none) makes sense and were under the impression that none could be used whenever we want to negate (treating single values as 1-element arrays).

So the proposal here would be to add two dimensions: any/all/none and should/must, as optional fields of properties, e.g. for my example above:

"properties": [
    {"pid":"type", "v":["Series", "Journal", "Periodical"], "match":"any", "mode":"must"}
    {"pid":"type", "v":["Article", "PublicationIssue"], "match":"none", "mode":"must"}
]

Further questions would be defaults (any and should?) and naming (match and mode?).

the spec documentation might indicate that pid duplications without or with all same weights[/mode] and match are equivalent

Yes, good point.

paulgirard commented 1 year ago

Yes the re-run issue is a big one. But this issue should not close the discussion here because tweaking on client side bring this other issue : how are candidate pre-selected? If properties are used to propose a "default" score than the issue remain even if clients can tweak ex-post. See my comment in the date comparison issue: https://github.com/reconciliation-api/specs/issues/114#issuecomment-1471516654

This discussion is important to take to be able to decide how the recon API should let users tweak default scores or not.

To come back to the turning mode into a qualitative parameter proposal. How would the recon API use this mode parameter? More specifically how the should would be used to score candidates?

fsteeg commented 1 year ago

More specifically how the should would be used to score candidates?

Unless I'm missing something I think the idea is that we leave scoring/weighting out of this completely. Just boolean logic (any/all/none) and optional/non-optional (should/must). With the difference that (1) changes inclusion in results vs (2) keeps all results, just changes scoring (somehow). I also think of it as a filter (must) vs. non-filter (should) behavior. How exactly scoring is affected in a should query would be completely up to the service (but could be reported back using matching features).