rfaulkner / wikipedia_user_metrics

Wikimedia Foundation E3 Team Analysis Code
Other
9 stars 5 forks source link

Generate user registration counts from logging table (not user table) #4

Closed rfaulkner closed 11 years ago

rfaulkner commented 11 years ago

https://github.com/rfaulkner/E3_analysis/blob/6c17247a9885379aea4f034e6e41c8cea88a4b65/src/metrics/threshold.py

The user table stores "auto creation" events for non-attached users in enwiki. The logging table provides more reliable counts of new registration on enwiki.

From D. Taraborelli:

after a little bit of research, the logging table has some extra log_action types that may explain the inflated counts in users when grouped by user_registration:

1) create2: this is for proxy-registered users, we had 49 such users on 2012-09-05 2) autocreate: this is for locally reserved user_ids generated on another wiki, we had a whopping 1947 events logged with this log_action on 2012-09-05

The sum of create, create2 and autocreate produces a total (6357) that is much closer to the figure from the user table (6437) but there are still 80 users missing.

I think it's safe to use WHERE log_action = 'create' AND log_type='newusers' as a condition to identify genuine on-wiki registrations.

rfaulkner commented 11 years ago

fixed: https://github.com/rfaulkner/E3_analysis/commit/34c1d75abccf5dfe8115adbc8a661554bd0e86ea