speedydeletion / wikiproc

processing tools for the wiki
0 stars 0 forks source link

https://meta.wikimedia.org/wiki/Research:Data#IRC_Feeds #3

Open h4ck3rm1k3 opened 6 years ago

h4ck3rm1k3 commented 6 years ago

Extract deleted articles from the recent changes irc feed

leucosticte commented 6 years ago

Some of the bots use the IRC feed but I think it's usually preferable to poll the API, since otherwise some items might get missed (e.g. if the bot or the IRC feed go down). The API lets you resume polling where you left off.

h4ck3rm1k3 commented 6 years ago

https://github.com/hatnote/wikimon there is a data feed via python

leucosticte commented 6 years ago

Which of us is going to do this task? I don't mind giving it a try.

What kind of output are you looking for -- a MediaWiki database full of these articles, or something else?

h4ck3rm1k3 commented 6 years ago

Just a list of articles we want to download to start with. I am very busy right now. Lets try the rt feed for now, if it goes down we can still poll.

On Wed, Apr 18, 2018 at 4:14 PM, Nathan notifications@github.com wrote:

Which of us is going to do this task? What kind of output are you looking for -- a MediaWiki database full of these articles, or something else?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/speedydeletion/wikiproc/issues/3#issuecomment-382514624, or mute the thread https://github.com/notifications/unsubscribe-auth/AACIVzZYqHhCsTVNm_TF_YiP8LwDNY6tks5tp56rgaJpZM4TVsi6 .

-- James Michael DuPont

leucosticte commented 6 years ago

So is the idea to see when there's a deletion log event in the recent changes feed, and then grab that article from the dump?

I coded something like this before, but the approach I used was to just grab all revisions of all articles. The reason is that if an article gets created one minute and deleted the next, if you grab it as soon as it shows up, you'll still have it even after it's deleted from Wikipedia. Otherwise, you'll lose the ability to obtain those revisions when the article is deleted. Your only option then (assuming you don't have a sysop account on Wikipedia) would be to look at the dump.

I have some documentation of the approach I was using, over at MediaWiki.org. https://www.mediawiki.org/w/index.php?title=Extension:MirrorTools&oldid=2497149

Part of my reason for polling is that you can tell the API to start at a given timestamp, so you can easily pick up where you left off if the bot gets interrupted.

h4ck3rm1k3 commented 6 years ago

Ok, we are going to need to actually talk about this more.