Closed wetneb closed 7 months ago
In today's call was expressed the preference to have a syntax less oriented towards the internals of search engines (retrieval / scoring) but rather something the user can relate to better (property should
or must
match).
Yes, this issue is still needed in some regard, but perhaps not 100% urgent? The use case for me came up when authority ids are re-minted in the case of development with Wikibase. In certain domains/namespaces, the authority ids are valid and I want to use them. In the other domains/namespaces, I want to hint to reconcile only using the names, as the ids are suspect in the borked Wikibase db merge that happened.
Currently, reconciliation queries consist in:
query
)properties
)Each of those properties comes with a property identifier (
pid
) and a value (v
).This lets clients refine the search for reconciliation candidates, by providing additional data to be taken into account.
Most reconciliation pipelines feature two separate steps:
At the moment it is up to the services to determine which parts of a reconciliation query they use for candidate retrieval or matching. This is often dictated by the structure of the database they are exposing, as indices are expensive to construct and maintain.
However, in certain cases it would be helpful to give the user some control over which parts of the query should be used at the retrieval stage.
Consider for instance a batch of reconciliation queries for cities, consisting of a name (in the
query
field) and of geographical coordinates (in some property supplied in theproperties
field). Potentially, the reconciliation service could have two indices at its disposal:With their knowledge of the quality of the data to reconcile (and of the structure of the authority database), the user could either decide:
At the moment our API does not let the user specify this sort of information. I think it would be worth introducing a syntax for that, for instance (random JSON syntax made up without putting a lot of thought into it):
Scenario 1 would be
and scenario would be:
Because we cannot reasonably expect services to have indices on every property they expose, I think this feature should be a
SHOULD
orMAY
and not aMUST
in the specs (hence the proposed syntax ofhint
, that the services can decide to follow or not).See also