Closed bkowshik closed 7 years ago
Awesome research into Vandalism detection on Wikipedia @bkowshik . The wiki community have a mature bot policy and encourage focused and effective mechanical editing that has built community curated ecosystem of AI workers that has been highly effective in quickly fixing the most common problems to occur. A large academic community is interested in the mechanics of this and the associated research has further helped to strengthen the defenses
To compare, the OSM Automated Edits Policy has not evolved much. Validation is a good angle to have some bots running to catch simple issues like a invalid capitalization in a tag like Highway=residential
You can check a model's statistics by dropping the revision ID from the path.
{
"params": {
"balanced_sample": false,
"balanced_sample_weight": true,
"center": true,
"init": null,
"learning_rate": 0.01,
"loss": "deviance",
"max_depth": 7,
"max_features": "log2",
"max_leaf_nodes": null,
"min_samples_leaf": 1,
"min_samples_split": 2,
"min_weight_fraction_leaf": 0.0,
"n_estimators": 700,
"presort": "auto",
"random_state": null,
"scale": true,
"subsample": 1.0,
"verbose": 0,
"warm_start": false
},
"table": {
"false": {
"false": 15563,
"true": 2551
},
"true": {
"false": 457,
"true": 962
}
},
"precision": {
"false": 0.971,
"true": 0.274
},
"trained": 1491356274.077835,
"type": "GradientBoosting",
"version": "0.3.0"
}
Vandalism detection on OpenStreetMap is similar to vandalism detection on Wikidata, both are structured datasets. With Wikipedia, things are different due to the more free-flow nature of the text. I am curious to see how ORES, a machine learning as a service for Wikimedia projects for vandalism detection and removal worked for Wikidata. The following is what I found.
There are 3 main models for Wikidata:
It looks like there are 5,000 samples that are manually labelled and 20,000 samples that are auto-labelled.
Looks like all the 3 kinds of models - reverted, damaging and goodfaith make use of the same set of features. The list of attributes can be found at the link below:
A bigger list of attributes can be found at the link below:
Model tuning reports:
Models for both Wikipedia and Wikidata get prepared together with a MakeFile
. Where datasets are downloaded, features extracted, models trained and reports are generated.
Properties about the model deployed can be viewed at the link below:
This has been super-helpful. No next actions here. Closing.
NOTE: This is a work in progress. Posting here to start discussion around the topic
Wikimedia uses Artificial Intelligence for the following broad categories:
On Wikipedia there are 160k edits, 50k new articles and 1400 new editors everyday. The goal is to split the 160k edits into:
Themes for validation
Welcoming newcomers
More newcomers is a major Wikimedia goal and new spaces have been developed to support newcomers. Quality control in Wikipedia is being designed with newcomer socialization in mind so that newcomers (especially those who don't conform) are not marginalized and good-faith newcomers are retained. Although anonymous edits on Wikipedia are twice as likely to be vandalism, 90% of anonymous edits are good.
From this Slate article:
Popular validation tools
There are around 20 volunteer developed tools, 3 major Wikimedia product initiatives. Some popular ones are:
Basic web interface for ORES at https://ores.wikimedia.org/ui Some of the features used to aid classification of a revision as problematic or not are: Is user anonymous, number of characters/words added, modified and removed, number of repeated characters and bad words added. Prediction scores for a problematic revision look like below:
https://ores.wmflabs.org/scores/enwiki/damaging/642215410
There has been quite a lot of research in this field evident from the number of results on Google scholar about Wikipedia vandalism detection.
Hyperlinks
Reading
Videos
cc: OpenStreetMap Community