rehamaltamimi / wikipedia-map-reduce

Automatically exported from code.google.com/p/wikipedia-map-reduce
0 stars 0 forks source link

User identification dependent on source of user data #2

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Users info can be gathered from <revision> data or from <article> data.  If 
it's from <revision> data, we have either a uID/Name pair or just an IP.  If 
it comes from <article> data, we have a Name or IP and an aID, but no uID.

Original issue reported on code.google.com by colin.t....@gmail.com on 4 Jun 2010 at 11:08

GoogleCodeExporter commented 8 years ago
User information by namespace:

  * Default - none from <article> tags, name/id from <revision>'s <contributor> tag.
  * Talk - none from <article> tags, name/id from <revision>'s <contributor> tag.
  * User - owner's name from <article> tags, owner's name/id from <revision>'s <contributor> tag ONLY IF the user has edited their User: page.
  * User talk - owner's name from <article> tags, owner's name/id from <revision>'s <contributor> tag ONLY IF the user has edited their User talk: page, other's name/id from <revision>'s <contributor> tag.

To summarize, we're guaranteed complete information only about editors, not 
owners of User: and User talk: pages.  We can only get complete information on 
those users if they also edited those pages.  Furthermore, the limited 
information we have on those that do not edit their own pages is not sufficient 
for identification, since names can be changed by users.

Original comment by colin.t....@gmail.com on 24 Jun 2010 at 7:16

GoogleCodeExporter commented 8 years ago
One possible solution is to ignore all users without complete information, that 
is, without BOTH a name and an ID.  Another option is to just ignore those with 
only names.

Either way, this causes user-based analysis to be less reliable, since not all 
users who have activity in the dataset will be in the resulting graph.

I'm making the decision to ignore users without IDs, and we can change it back 
later if it comes to it.

Original comment by colin.t....@gmail.com on 24 Jun 2010 at 7:20