google / wikiloop-doublecheck

WikiLoop DoubleCheck: a web tool to help review Wikipedia edits easily and collaboratively.
http://doublecheck.wikiloop.org
Apache License 2.0
79 stars 58 forks source link

Help migrating away from ORES #444

Open isaranto opened 11 months ago

isaranto commented 11 months ago

Hi! I am part of the Wikimedia ML team, we are starting the migration of ORES client to another infrastructure, since we are planning to deprecate it. More info in https://wikitech.wikimedia.org/wiki/ORES

TL;DR:

The ORES infrastructure is going to be replaced by Lift Wing, a more modern and kubernetes-based service. All the ORES models (damaging, goodfaith, etc..) are running on Lift Wing, more on how to use them in https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage We have new models called Revert Risk, to replace goodfaith and damaging for example. The are available on Lift Wing, and we'd like to offer them as valid and more precise/performant alternative to ORES models. If you'd like to try them we'd help in the migration process! Thanks in advance,

ML team

welcome[bot] commented 11 months ago

Thanks for opening your first issue here! Be sure to follow the issue template!

xinbenlv commented 11 months ago

Hi, @isaranto , that would be awesome!

AikoChou commented 11 months ago

Hello! We have noticed that Wikiloop might be using the mediawiki.revision-score stream. However, the mediawiki.revision-score stream will also be deprecated with ORES. For users who use the stream, the Wikimedia ML team plans to offer several streams, each associated with a single model score, such as:

mediawiki.revision-score-goodfaith mediawiki.revision-score-damaging

Alternatively, we have new models called Revert Risk to replace goodfaith and damaging, and we could provide a stream for the revert-risk score.

If Wikiloop is currently ingesting events from the mediawiki.revision-score stream, please let us know your preference.

You can find more information in our thread: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/X5KUTNHW646KYGE7V6SDSHVGVOL5DFDX/

elukey commented 10 months ago

@xinbenlv Hi! Is what @AikoChou wrote good in your opinion? We are trying to figure out remaining users of the revision-score stream :)

xinbenlv commented 10 months ago

I will take a look. thank you!

xinbenlv commented 10 months ago

It would be great if we can get a score of "borderline-ness" because we want to let human prioritize reviewing those borderline between damaging and goodfaith

elukey commented 10 months ago

It would be great if we can get a score of "borderline-ness" because we want to let human prioritize reviewing those borderline between damaging and goodfaith

@xinbenlv could you clarify the above point? More specifically, we'd need to understand if you'd need streams or if you'b be happy to query the new API (https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage).

We also offer a new model called Revert Risk Language Agnostic (specs, API), that should be a replacement of both damaging and goodfaith (they are still available via Lift Wing though, if needed).

xinbenlv commented 10 months ago

let me give a bit context about why we use ORES in WikiLoop DoubleCheck in the firstplace: WikiLoop DoubleCheck intends to "put human in the loop" for fact checking with "AI support", so we use ORES to find "borderline suspicious edits".

"Borderline means:

With such context, what's your suggested API?

elukey commented 10 months ago

@xinbenlv thanks for the explanation! I'd go for Revert Risk for two reasons:

1) It is a brain new model, trained with recent data, and fully supported by the WMF Research team. The goodfaith/damaging models are still supported but they will not be improved any further, since they are old and difficult to manage (so we'd prefer to simply deprecate them in the future). 2) It gives a single score on a specific rev-id, assigning to it a value that tells how much confident the model is that a revert needs to happen. Based on this score value you can decide whether it fits in our obviously good/bad use cases, or not. The score is basically a probability, so something like 1-10% or 95-99% could be ranges that you don't want a human involved, meanwhile for the rest yes (I am writing numbers without much thinking, just to give an idea :)).

On the implementation side, we (as ML WMF) are trying to deprecate the revision-score stream from https://stream.wikimedia.org since we'd like to break it down into multiple ones. Basically instead of having a lot of scores fro m different models for every revision-id (like in revision-score), we will have a stream for every model (rev-id -> model score). We still don't have a stream for Revert Risk, but we are planning to add one soon-ish.

We checked your code and we found references of revision-score, so what we are wondering is: 1) Are you still actively consuming data from it? Or do you get your scores directly from the ORES API on demand? 2) If you use the stream, would it be ok to move to another stream (like Revert Risk, if you decide to migrate to that model) during the next couple of months (waiting for us to make it available)? In this case it would be without any data from revision-score, since we'd deprecate it for good.

We don't want to break users, so we are trying to follow up as best as we can to support all of you :) Lemme know!

elukey commented 10 months ago

To be more precise: https://github.com/google/wikiloop-doublecheck/blob/master/server/ingest/ores-stream.ts#L26

The above is the snippet of code that we are referring to, but since I don't see any trace of traffic from you related to it, I am wondering if it is running or not :)

elukey commented 10 months ago

@xinbenlv thoughts? :)

xinbenlv commented 10 months ago

Sorry for a late response. Let me take a look

elukey commented 10 months ago

Thanks! We have already stopped the stream (https://phabricator.wikimedia.org/T342116), lemme know if it impacts your project.