reconciliation-api / specs

Specifications of the reconciliation API
https://reconciliation-api.github.io/specs/draft/
31 stars 9 forks source link

Date comparison needs an absolute settings to compute relative diff score #114

Closed paulgirard closed 9 months ago

paulgirard commented 1 year ago

When testing matches on date properties, the recon service needs to output a 0-100 score from a date comparison. If scoring 100 is clear (perfect match), the continuous score domain between 0 an 99.99 is not. It demands to define a scale of time which would transform a time difference into a 0-99.99 float number. This needs a decision from the user as it depends on the use case. Users aligning items from a long-run history study spanning millenniums would not require the same scale as contemporary event date for instance.

To illustrate here is a commit I issued to adapt the wikidata recon api date scoring to a personal use case: https://gitlab.com/ouestware/openrefine-wikibase/-/commit/ee2bf57842b2d0dd00cf7e3f916461c03aabb164 (I think the scoring system I drafted should be simplified to express score as a function of time difference)

It might worth letting the user define the date precision scale in the query.

One way could be to define from which time duration (express as ISO 8601) difference score should be 0. Which is equivalent to define the time approximation span in which candidates dates must be to be scored as potential candidates. But there are many more ways to solve this issue I am sure?

Laziness disclosure: I haven't checked if such a settings already exist...

I hope this reflexion could help future improvements of the spec.

wetneb commented 1 year ago

I agree that as a service implementer it is quite tricky to implement this in a satisfactory manner. The same problem applies to geographical coordinates or numerical data.

Given that those datatypes are not even mentioned by the specs I think it will be rather difficult to add support for parametrizing their matching in a satisfactory way, i.e. understandable for the users, generalizable to other cases where matching needs to be configurable, and so on. We would likely need something similar to the property settings of the data extension endpoint, so that services could expose pretty arbitrary forms to the users, but I think it is quite wonky and does not give a very satisfying UX.

I would prefer to solve this problem by letting clients access the raw data and re-score the candidates on their side with the strategy they want. This could be done either using the existing data extension api to fetch properties of reconciliation candidates and re-score them based on that, or add support for returning property values (and not just matching features) in reconciliation responses.

paulgirard commented 1 year ago

I totally see the advantages of tuning scoring client-side not asking to re-run the complete recon process. But that approach implies that the recon API pre-select candidates based on what? Solely on the label? In that case the recon API would not even use properties to score candidates at all?

My point here is that from the moment when the recon API proposes a default score based on property it needs to let the user tune this default scoring as there is no way to propose a default scoring that actually make sense.

Then I also tend to think that those comparison issues are pretty generic. Or at least there are ways to propose default scoring that fits a large part of usage needs. And those would benefit from being developed and discussed by the recon community as much as the recon specs are. Even if they are in fine developed in the recon clients.

To sum-up a big decision to take I guess is : should recon API propose a default score based on properties or only use label or such?

wetneb commented 1 year ago

I think it is still possible for recon services to propose some default scoring based on labels and properties: this is what recon services currently do, right? Yes, it's not very precise and flexible, but that's what makes it a default scoring.

Then I also tend to think that those comparison issues are pretty generic. Or at least there are ways to propose default scoring that fits a large part of usage needs.

You are totally welcome to propose something along those lines! I am skeptical but I am not the only one to have a say about this ^^

paulgirard commented 1 year ago

Well I totally feel your skepticism but the moment when recon API proposes default scoring based on properties the recon API needs to solve the scoring issues event though those are only the default. Those default would actually be the only score for a lot of users who does not want to tweak anything.

My point in this issue is that the default date scoring in the wikidata recon service would be (in my opinion) better if made fuzzy. The bad news is that fuzzy scoring needs a settings to work properly.

So actually the debate would be reshaped as : should recon API default score use fuzzy matching? Indeed problems arise the moment when scoring is using fuzzy matching. The recon API would probably (at least for dates) not need scoring settings at all if default scoring are restricted to a perfect match binary score system (actual system for dates in wikidata recon API).

thadguidry commented 1 year ago

If we take a use case that I see often enough from the database world. An ambiguous date, or uncertain date... but certain enough for a user.

EVENT_DATE 1998-00-00 1998-05?

A service might send this raw date string and also a score of 50% when queried with 1998~ if it had knowledge of EDTF and ISO 8601-2 date extension. A user might tune the score client side to 100% for any raw date strings returned from the service that end with -00-00 since they only care about the year of a major flood that they are reconciling their list of natural events and not month or day.

I don't think the Reconciliation API spec should dictate defaults for properties for a service to return for any datatype. The only defaults I'd like to see in the Reconciliation API spec would be "" empty (or null) when asked for a property without any stored or known value to return, to coincide with a none query parameter if a client asks that way.

I actually think scoring should be considered an optional thing provided by a service. I think the spec should highlight this more. A smart client could do very good scoring and likely better based on user input, feedback loops, etc. against a service even if the service provided its own scoring.

@paulgirard I'd rather just use Wikidata recon service as an example. When it comes to making any recon service better, we, as a community, should engage with a recon service provided to ask if they can improve things. On our spec side here, we seem to already have what we need in place to make it easier for a service provided like Wikidata to do fuzzy date matching. EDTF support is one way and that is just smarter property value parsing, and can be accomplished with a smart client or a smarter recon service.

@paulgirard I'd rather no defaults mentioned in our specs regarding scoring. Instead, the spec could refer to a guidance document or forum on "Property Scoring Approaches". Where we mention things like EDTF for dates, null versus empty string, etc. etc. that would directly affect making clients themselves smarter through more consistent handling of raw data back to them.

paulgirard commented 1 year ago

Thanks for the EDTF spec ref. I didn't know it, it's gorgeous!

For the rest I have to say I am bit lost as I am not a spec person at all. I guess the easiest way to go on with this discussion would be to attend the April meeting for a live chat about it.

tfmorris commented 1 year ago

I'm very skeptical of so-called "client-side scoring." As @paulgirard points out, the reconciliation service still needs to rank the candidates that it's going to return to the client.

Returning to dates, temporal distance isn't the only measure which could be used. In some cases, lexical distance may relevant as well. Are 1801-01-01 and 8101-01-01 separated by 7,000 years or an edit distance of 2?