reconciliation-api / specs

Specifications of the reconciliation API
https://reconciliation-api.github.io/specs/draft/
33 stars 11 forks source link

feedback for ML tuning weights #30

Open wetneb opened 4 years ago

wetneb commented 4 years ago

Sending reconciliation decisions to a service Reconciliation services are currently unaware of which of their proposed candidates was picked by the user (if any). In the January CG call we discussed that there could potentially be a method in the API to do something along these lines. A client would send back chosen matches using a dedicated API method. They would likely need to refer to the original query in some way (or provide it again). This would probably be an opt-in feature for most clients for privacy reasons.

If the service provider wants to rely on this data to tune the weights of their scoring mechanism (for instance), they probably want to rely on user authentication (#26) to be able to attribute decisions to particular users.

VladimirAlexiev commented 4 years ago

I talked just yesterday to Ontotext's CTO @vassilmomtchev about such a feature. Modern entity matching frameworks (eg Magellan) use feedback for tuning ML-based algorithms, and Reconciliation should have the same.

tfmorris commented 4 years ago

This would probably be an opt-in feature for most clients for privacy reasons.

This definitely needs to be opt-in and not something mandated by the API/protocol.

workergnome commented 4 years ago

Completely agree. Opt in, but would be good to have a pattern to reuse.


From: Tom Morris notifications@github.com Sent: Friday, February 28, 2020 11:27:43 AM To: reconciliation-api/specs specs@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [reconciliation-api/specs] feedback for ML tuning weights (#30)

This would probably be an opt-in feature for most clients for privacy reasons.

This definitely needs to be opt-in and not something mandated by the API/protocol.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/reconciliation-api/specs/issues/30?email_source=notifications&email_token=AAAIN2H72PNGPNZQEXLMDG3RFFQS7A5CNFSM4KFQ22K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENJ3HMA#issuecomment-592688048, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAAIN2CAQELQ4YRVE4WIGQ3RFFQS7ANCNFSM4KFQ22KQ.

wetneb commented 4 years ago

For generic reconciliation services like Wikidata, I would probably not rely on this feature if it existed in the specs, because of the diversity of the queries / use cases / matching criteria. If I could collect final decision matches from users, I am not sure this would make a good dataset to influence the scoring mechanism in the service. Because of the diversity of data shapes and matching criteria each user has, taking into account the decisions of user A in our scoring might not help user B at all. Also, updating the scoring mechanism server-side makes reconciliation workflows less reproducible.

Since this must be opt-in, the question is also how to incentivize users to send this feedback. What benefit do they get out of it? The vague promise that their decisions might be used down the line to change the scoring mechanism? How will they know if/when they can count on that happening? But that's just a comment about my own use cases - no particular opposition to adding it to the specs if it can be useful for other services.

thadguidry commented 4 years ago

After today's call and @workergnome feedback, I actually feel like part of what @wetneb was describing about feedback with classifier routines and ML, almost feels like it deserves ANOTHER API for ML workflows. Call it a classifier-reconcile-api or whatever you want, but to me, as @tfmorris says, this should be opt-in and I would say to take it further and separate it into another API spec entirely.

The risk of separating the API's (1 for humans|1 for machines) is what, however? (playing devils advocate here)

wetneb commented 4 years ago

@thadguidry this issue is different from what we have been discussing today. At least I was talking about #31.

MatthiasWinkelmann commented 4 years ago

I had been thinking about implementing something like this. Glad to see others agree, less glad to see it isn't quite there yet.

While I agree that it should be optional, I don't necessarily get the significance it is given in this discussion. I can't come up with a realistic scenario where any private data is compromised by such a feedback mechanism but not the initial query.

Assuming services see actual value in this data, clients should make opting in explicit but easy: not hiding it in a configuration file but, for example, specifically asking the user to opt in or out, and optionally remembering that decision.

@wetneb's concerns come, I believe, from a slightly too simple idea of how something like this would work: the system would not save individual matching decisions in their entirety and then reproduce them when exactly the same query is made again. It's probably better to think of this as changing the matching algorithm instead of changing any entity data. It would allow, for example, learning about the relative importance of different property matches. Or to create far more realistic algorithms for scoring differences: month and day of a birth date being switched might happen more often than other errors, and should therefore incur less of a penalty. Names are already scored using soundex or similar, but that doesn't capture the similarity of "William" and "Bob". Middle names are highly relevant in some cultures (George H. W. vs George W), while in others everyone has them but they are commonly dropped in all but formal communications.

The universe of such nuances is pretty much endless, and certainly greater than anyone's ability to manually catalog. And while individual data shapes may change between users, any halfway decent system is bound to improve matching drastically, if only because what we currently have is really just voodoo.

Having that data also allows measuring the scoring performance. That's a huge benefit even without any attempt at machine learning. The scoring and matching could evolve as it did until now, by reasoning our way to what we believe to be better and implementing it in code. But after any change, we could then re-run past data, measure any improvements, and also detect pathological cases we just introduced.

wetneb commented 4 years ago

@wetneb's concerns come, I believe, from a slightly too simple idea of how something like this would work: the system would not save individual matching decisions in their entirety and then reproduce them when exactly the same query is made again. It's probably better to think of this as changing the matching algorithm instead of changing any entity data.

Then I was not clear enough: I would absolutely not reproduce learned replies when the exact same query is made again, that would not make sense. The goal is indeed to improve the scoring mechanism in general.

But as a user, I would not like to use a service whose scoring mechanism changes unpredictably depending on the matching decisions submitted by other users. This means that the same reconciliation workflow on the same data could give different results if done a few days later.

It would allow, for example, learning about the relative importance of different property matches.

It might make sense for some services to learn a relative importance of properties over all the queries they get, but not for services like Wikidata, where this relative importance really depends on the dataset you have at hand.

Say I am matching a dataset of sportspeople with names, sport practiced and nationality. In my dataset, names only contain initials for the given names, so perhaps the matching will be better with more weight on the sport and nationality. But perhaps the day after, you want to match another dataset, which also happens to have names, sport practiced and nationality. But in your dataset the names are completely spelled out, even with middle names, and the nationality column is less reliable because it is sometimes mixed-up with the country the person plays for as a sportsperson. So in your case you probably want a higher weight on the name and less reliance on the nationality. If the service learns your weights, and I come back to match more of my dataset, I will get a less precise matching and will have to push back the service's settings to fit my needs.

This is why I would like to make it possible to learn weights client side, by letting services expose more granular features (#38). That being said I am not opposed to making it possible to submit decisions as proposed here - if you have an idea of what this should look like, why not submit a PR?

VladimirAlexiev commented 4 years ago

@wetneb You give an excellent example of two matching scenarios that put different importance on the same features. Let's call this a "feature scenario".

I think the client needs to expose these preferences to the server in a declarative way. Maybe the features described in #38 and https://reconciliation-api.github.io/specs/latest/#dfn-matching-feature are the way to go.

A naive proposal:

make it possible to learn weights client side

But if the client can't affect server-side processing, what good is that?

The client can only afford to get a limited number of candidates per row. He could use the exposed feature weights to order them in a different way, but if the best candidate is not in that limited selection, he's screwed. We need to find a way to expose to the server the feature preferences embodied in the reordering.

I can't believe we'd be the first to face this fundamental problem: how can ML client and servers interact. @vasoto, in your study of ML and Data Science ontologies, have you seen something about describing features and preferences in a structured way?

wetneb commented 4 years ago

make it possible to learn weights client side

But if the client can't affect server-side processing, what good is that?

I would personally find it useful to be able to locally re-score the reconciliation results returned by the service. This re-scoring could either be done using a scoring function I came up with manually, or be learned from data (train a classifier on a few samples of annotated data).

Let's make this super concrete to make sure we are on the same page on this.

I reconcile a database of films to Wikidata using the following information: title, director, producer and filming date. After reviewing the reconciliation candidates returned by the service, I realize that it is common that the correct reconciliation candidate is not the highest-ranking one, but a bit further down in the list. I notice that this happens because the reconciliation service does not give enough weight to the director and too much to the filming date. I therefore want to re-order the reconciliation candidates for each row according to a new scoring function. Perhaps I want to write my own scoring function based on the features given by the service (for instance as a GREL expression, if the features are exposed in GREL), or if this is supported I could simply annotate a few examples (tell which reconciliation candidates are correct) and OpenRefine would fit a classifier to these data points, which would hopefully generalize well in other cases.

Let me know if you still struggle to see the interest in this, happy to expand on any step!

VladimirAlexiev commented 4 years ago

but a bit further down in the list.

How do you know most matches will be "a bit further" and not "quite further"? People typically ask for top-3 or top-5 matches. If the correct match is not within this short list, there's nothing you can do client-side.

Another (smaller) concern: the option "auto-match top candidate" is very important on large sheets. In your scenario you'd need multiple recon steps on the client side:

AFAIK the last 2 items can't be done with OpenRefine at present

wetneb commented 4 years ago

How do you know most matches will be "a bit further" and not "quite further"? People typically ask for top-3 or top-5 matches. If the correct match is not within this short list, there's nothing you can do client-side.

Of course. This only works when the default scoring mechanism of the service is good enough to surface the correct candidates somewhere in the list.

Another (smaller) concern: the option "auto-match top candidate" is very important on large sheets.

This can already be done in OpenRefine, there is a button for this in the "Reconcile" -> "Actions" menu. You are correct that there is no way to reorder candidates at the moment.

thadguidry commented 4 years ago

@VladimirAlexiev Hi Vladimir,

Were you thinking of exposing something like a PIT (point in time) or a search_after to allow Reconcile providers to show their clients that there are next pages of hits, regardless of client preferences for limiting the size of results returned? That way they would still get some kind of indicator that "there are more pages of reconcile results available for this recon entity"? The scoring of each page or listing could reflect that "we are not done yet here" and show reduced weighting for scores because both sides know that a full picture hasn't surfaced yet. Something how Elasticsearch does: https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html https://www.elastic.co/guide/en/elasticsearch/reference/current/filter-search-results.html

VladimirAlexiev commented 4 years ago

@wetneb and @thadguidry If I have 10k rows, I'd like to get 3-5 matches per row, not a number of matches that would require pagination

thadguidry commented 4 years ago

@VladimirAlexiev Sure, understood, but it's always about context. For example, if you were reconciling Lexemes, there are entities (words) in the English language that have over 400 senses. One of those words is "set". Imagine I have in my client that word "set" and want to reconcile it against 1 of those 400 senses. For manual reconciling, I need to see all 400 senses in order to choose the right one. For ML tuning, of course it would need additional info besides just the Lexeme, but even with ML tuning hints to reduce the search scope, it might result in 50 plausible matches of a sense of "set". I'm not an expert on ML tuning parameters, I'm simply laying out the context in 1 example. Many clients will have differing needs. Large numbers of possible matches comes about a lot in biology and linguistics.

The word with the most meanings in English is the verb 'set', with 430 senses listed in the Second Edition of the Oxford English Dictionary, published in 1989. The word commands the longest entry in the dictionary at 60,000 words, or 326,000 characters.

epaulson commented 3 years ago

This issue's been quiet for a while but I hope there's still interest in it. I've been thinking that one thing the recon API might benefit from is having some notion of a session. You'd establish a session ID with the recon API, probably after getting the manifest but before doing your first reconcile query. Then, in your reconciliation queries, you could include the session ID, and if the server wants to track any state for you, the server can.

The use case that I was thinking is a variation on the feedback for weights - perhaps I've got some pre-labeled examples that I'd like to send to the recon API that might help it fine-tune its weights for my upcoming queries. I would rather not include that batch of examples in my actual queries, in part to keep the amount of data that I have to send lower, and because the server-side API might need a bit of time to fine-tune its model for my session based on the examples that I just sent it, so I don't want to send my queries right away. (Ideally, I could poll the server to find out when my session is fully trained and ready to go)

This is basically the same thing as sending feedback on which choices out of the candidates the human actually chose. The recon API is really close to being a great API for some human-in-the-loop workflows, but I think it needs something more like a session to really be able to be used as a loop.

As an added bonus, I think it would help give the server some implementation flexibility to manage privacy concerns and to better prepare for the API to be more asynchronous/better recover from crashes on the client side, and maybe to be smarter about any caching policies the server might want to try to manage, especially if the service is trying to do some load-balancing across different backend servers.

It certainly could be an optional extension for the recon API and servers could decide not to support it, and clients just treat the service as the best-effort API that exists today.

But overall, I think an explicit notion of a session in the API would help with the implementation of the ideas in this issue.

(I think the API might want to still also go a bit farther and make the actual reconciliation query POSTs include a request ID/handle so individual batches from each POST can be async/recoverable, but even then there's still value in having an overarching session and adding request IDs to the batches could be phase 2)

epaulson commented 3 years ago

I put together an example server that implements some of my session idea. The writeup was a bit long for a comment, so I posted it as a message on the mailing list:

https://lists.w3.org/Archives/Public/public-reconciliation/2021Jun/0000.html

(Also, about 50% of it is not about ML feedback, but is instead about using dedupe, which is a state-of-the-art open source entity matching library that trains a matcher from a handful of examples, as the matching engine)

The workflow isn't quite what Antonin described, where you'd give feedback on the results of individual batches. The approach I took is necessary for other use cases, but can sort of subsume the workflow Antonin described - but it'd be easy to add in support for Antonin's workflow as well.

One thing to keep in mind is this adds a bunch of new state at the server - my approach adds a session that you add positive match examples into, but if you want to try to give feedback on the results of individual queries, you'd also need to keep track of the individual query IDs from the reconciliation query batch (e.g. the 'q0', 'q1', 'q2' etc from the examples ) - in fact, you'd probably need to say that the batch itself needs an ID, or at least the 'q0', 'q1', etc in each batch somehow need to be made unique. Then you have to figure out how long you want to keep that around for - can a user give feedback on a match the reconciliation API suggested a year ago?

The reconciliation API/protocol is really close to the what you need for a good human-in-the-loop entity matching system, and a few more batch IDs would help to make the protocol more async-friendly, which would help OpenRefine's UX when reconciling/running the data extension API.

thadguidry commented 3 years ago

@epaulson Does you server and session take into account "disambiguating data" such as extra columns of data or properties that a human used in their own decision loop to make their match? And is this "disambiguating choice data" uploaded back to the server via the API? or is only the reconciled candidate choice alone uploaded back to the server? Classification systems will need it all if there is any reasonable expectation to get something useful I think. See the use case I linked above https://github.com/wetneb/openrefine-wikibase/issues/117

epaulson commented 3 years ago

@thadguidry for additional 'disambiguating data' while searching for a match I didn't do anything special protocol-wise, the existing 'properties' option in a reconciliation query seemed great - if you're matching on a column for 'university name' and the row is 'Washington', adding a property for 'city' lets you figure out the one with 'Seattle' and the one with 'Lexington'. But that was of course already included in the existing API specs and OpenRefine has a nice UI for it.

For training purposes, you would upload those additional columns as part of the set of labeled examples. For example, if your unlabeled dataset is two columns: <university_name, city> and you want to reconcile/predict the QID the actual university, for the training, the protocol extension I envisioned meant that you uploaded a small dataset with three columns: <university_name, city, QID>, and the system trained on that for every future unlabeled example you queried for in your session.

The DeDupe python library that my test server uses had support for including multiple columns while training and making match decision so it was easy to include in my test server. DeDupe might be right for some future API implementors but I think most will want to write their own matching algorithm.

For reporting back which choice was made for a set of specific rows - ie a human turned a bunch of unlabeled/unmatched rows into labeled examples, I decided not to special-case that. It didn't seem compelling to have a different endpoint for that, since after a human reconciled a set of unmatched data into a set of matched data, they could just treat that newly human-matched data as more training examples and upload them to the existing endpoint as additional training data.

One thing that this means is that I did NOT require the client to save the candidates that were returned for each query. If the server wants to use those non-matched candidates in its training, I assumed the server would have to reconstruct what it would have sent to the client for that choice - e.g. if the training example is <uname1, city1, QID1234>, I assumed that the server would know that for <uname1, city1> it would have suggested <QID1234, QID1235, QID1236> as its original match candidates and can train itself as appropriate. You could certainly make the argument that the protocol for training responses should send back something like <input: {uname1, city1}, Match: {QID1234}, NonMatch: {QID1235, QID1236}>.

Including example non-matches is really important for most Entity Matching systems so including the set of suggested candidates absolutely is something to think about including, but again that could just be part of the regular training endpoint and might not have to be called out as a separate API. (If you haven't given DeDupe.io a try or watched a training video for it you really should, they've got a nice UX for this)

In my server I just made up a new format for what to send to the server for training examples, but the idea in https://github.com/wetneb/openrefine-wikibase/issues/117 is to use the operation history file of OpenRefine, which is an appealing idea. There's a lot of non-relevant info included in that file but the OpenRefine operation history might be a nice reuse of existing code.

The other difference between what I'm suggesting and what #117 is suggesting is that I was scoping it all to a "session", which for privacy I figured would belong to a single user, and not a larger collaborative effort, but I think you could make it work for an opt-in collaboration. It would simplify the server, too, because it would eliminate some questions of privacy on the server. (Which I think will also help with scaling when that eventually matters)

thadguidry commented 3 years ago

@epaulson Thanks for the feedback Erik. Yes, I've played with dedupe library before and thoroughly enjoyed reading Mikhail's dissertation much later than published, back in the day a bit after Gridworks (later OpenRefine) came into being and Stefano's clustering notes somehow pointed me to it as I recall.

thadguidry commented 3 years ago

@epaulson You might also be interested in https://arrow.apache.org/docs/format/Flight.html