reconciliation-api / specs

Specifications of the reconciliation API
https://reconciliation-api.github.io/specs/draft/
30 stars 9 forks source link

Add ability to mark reconciliation property as required (used for candidate retrieval) #101

Closed wetneb closed 7 months ago

wetneb commented 1 year ago

Currently, reconciliation queries consist in:

Each of those properties comes with a property identifier (pid) and a value (v).

This lets clients refine the search for reconciliation candidates, by providing additional data to be taken into account.

Most reconciliation pipelines feature two separate steps:

At the moment it is up to the services to determine which parts of a reconciliation query they use for candidate retrieval or matching. This is often dictated by the structure of the database they are exposing, as indices are expensive to construct and maintain.

However, in certain cases it would be helpful to give the user some control over which parts of the query should be used at the retrieval stage.

Consider for instance a batch of reconciliation queries for cities, consisting of a name (in the query field) and of geographical coordinates (in some property supplied in the properties field). Potentially, the reconciliation service could have two indices at its disposal:

With their knowledge of the quality of the data to reconcile (and of the structure of the authority database), the user could either decide:

  1. that names are more reliable, hence they should be used for candidate retrieval, and the coordinates should only be used as tie-breakers between namesakes at the scoring phase
  2. or that coordinates are more reliable, so they should be used for retrieval, and names are then used for scoring in the second step

At the moment our API does not let the user specify this sort of information. I think it would be worth introducing a syntax for that, for instance (random JSON syntax made up without putting a lot of thought into it):

Scenario 1 would be

{
    "query": "Cambridge",
    "query_hint": "retrieval",
    "properties": [
         {
             "pid": "coordinates",
             "v": "52.205278,0.119167",
             "hint": "scoring"
         }
    ]
}

and scenario would be:

{
    "query": "Cambridge",
    "query_hint": "scoring"
    "properties": [
         {
             "pid": "coordinates",
             "v": "52.205278,0.119167",
             "hint": "retrieval"
         }
    ]
}

Because we cannot reasonably expect services to have indices on every property they expose, I think this feature should be a SHOULD or MAY and not a MUST in the specs (hence the proposed syntax of hint, that the services can decide to follow or not).

See also

wetneb commented 1 year ago

In today's call was expressed the preference to have a syntax less oriented towards the internals of search engines (retrieval / scoring) but rather something the user can relate to better (property should or must match).

thadguidry commented 1 year ago

Yes, this issue is still needed in some regard, but perhaps not 100% urgent? The use case for me came up when authority ids are re-minted in the case of development with Wikibase. In certain domains/namespaces, the authority ids are valid and I want to use them. In the other domains/namespaces, I want to hint to reconcile only using the names, as the ids are suspect in the borked Wikibase db merge that happened.