Extract a list of deleted articles from the wikipedia

h4ck3rm1k3 commented 6 years ago

We need a tool that will extract a list of currently deleted articles from the wikipedia. It should emit them one per line sorted alphabetically. Comparing the current and last file will tell us the new items to fetch.

leucosticte commented 6 years ago

"Currently deleted" -- so it'll be necessary to start with the list of article deletion log events, and check that against the list of currently existing articles?

Over at https://dumps.wikimedia.org/enwiki/20180401/ there's a List of page titles in main namespace (enwiki-20180401-all-titles-in-ns0.gz) and a Recombine Log events to all pages and users (enwiki-20180401-pages-logging.xml.gz), so that could be a starting point, if that's what you have in mind.

h4ck3rm1k3 commented 6 years ago

Article deletion log events is fine, so deleted at any point should be a good starting point.

leucosticte commented 6 years ago

I was wondering, what's the best strategy for parsing in chunks? I was looking at this code (modified slightly from http://php.net/manual/en/function.xml-parse.php ):

<?php
$stream = fopen('enwiki-20180401-pages-logging.xml', 'r');
$parser = xml_parser_create();
// set up the handlers here
while (($data = fread($stream, 16384))) {
    xml_parse($parser, $data); // parse the current chunk
}
xml_parse($parser, '', true); // finalize parsing
xml_parser_free($parser);
fclose($stream);

I have it running now, but I'm wondering, "Okay, where is this data going as the XML file is being parsed, and how do I access it?" I would think that sometimes, the 16384 bytes will cut off in the middle of a log event, so wouldn't that cause issues if you're trying to deal with that data one 16,384-byte chunk at a time? By the way, the XML looks like this:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.1
0/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>enwiki</dbname>
    <base>https://en.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.31.0-wmf.27</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Media</namespace>
      <namespace key="-1" case="first-letter">Special</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Talk</namespace>
      <namespace key="2" case="first-letter">User</namespace>
      <namespace key="3" case="first-letter">User talk</namespace>
      <namespace key="4" case="first-letter">Wikipedia</namespace>
      <namespace key="5" case="first-letter">Wikipedia talk</namespace>
      <namespace key="6" case="first-letter">File</namespace>
      <namespace key="7" case="first-letter">File talk</namespace>
      <namespace key="8" case="first-letter">MediaWiki</namespace>
      <namespace key="9" case="first-letter">MediaWiki talk</namespace>
      <namespace key="10" case="first-letter">Template</namespace>
      <namespace key="11" case="first-letter">Template talk</namespace>
      <namespace key="12" case="first-letter">Help</namespace>
      <namespace key="13" case="first-letter">Help talk</namespace>
      <namespace key="14" case="first-letter">Category</namespace>
      <namespace key="15" case="first-letter">Category talk</namespace>
      <namespace key="100" case="first-letter">Portal</namespace>
      <namespace key="101" case="first-letter">Portal talk</namespace>
      <namespace key="108" case="first-letter">Book</namespace>
      <namespace key="109" case="first-letter">Book talk</namespace>
      <namespace key="118" case="first-letter">Draft</namespace>
      <namespace key="119" case="first-letter">Draft talk</namespace>
      <namespace key="446" case="first-letter">Education Program</namespace>
      <namespace key="447" case="first-letter">Education Program talk</namespace>
      <namespace key="710" case="first-letter">TimedText</namespace>
      <namespace key="711" case="first-letter">TimedText talk</namespace>
      <namespace key="828" case="first-letter">Module</namespace>
      <namespace key="829" case="first-letter">Module talk</namespace>
      <namespace key="2300" case="first-letter">Gadget</namespace>
      <namespace key="2301" case="first-letter">Gadget talk</namespace>
      <namespace key="2302" case="case-sensitive">Gadget definition</namespace>
      <namespace key="2303" case="case-sensitive">Gadget definition talk</namespace>
    </namespaces>
  </siteinfo>
  <logitem>
    <id>1</id>
    <timestamp>2004-12-23T03:20:32Z</timestamp>
    <contributor>
      <username>Slowking Man</username>
      <id>56299</id>
    </contributor>
    <comment>content was: '[[Media:Example.og[http://www.example.com link title][http://www.example.com link title]''Italic text'''''Bold text'''jjhkjhkjhkjhkjhjggghg]]'</comment>
    <type>delete</type>
    <action>delete</action>
    <logtitle>Vivian Blaine</logtitle>
    <params xml:space="preserve" />
  </logitem>
  <logitem>
    <id>2</id>
    <timestamp>2004-12-23T03:24:26Z</timestamp>
    <contributor>
      <username>Fredrik</username>
      <id>26675</id>
    </contributor>
    <comment>{{GFDL}} {{cc-by-sa-2.0}}</comment>
    <type>upload</type>
    <action>upload</action>
    <logtitle>File:Mini Christmas tree.png</logtitle>
    <params xml:space="preserve" />
  </logitem>
  <logitem>
    <id>3</id>
    <timestamp>2004-12-23T03:27:51Z</timestamp>
    <contributor>
      <username>Slowking Man</username>
      <id>56299</id>
    </contributor>
    <comment>content was: 'Daniel Li is an amazing human being.'</comment>
    <type>delete</type>
    <action>delete</action>
    <logtitle>Daniel Li</logtitle>
    <params xml:space="preserve" />
  </logitem>

I can figure out a way to parse it by reading the file one line at a time and using functions like strpos() (in fact, I probably have such code lying around somewhere), but I figured you probably wanted me to do it the "right" way.

h4ck3rm1k3 commented 6 years ago

The parser will keep state so when you go over a block barrier the parser will know what to look for next and keep some data in memory.

leucosticte commented 6 years ago

Did you want the non-mainspace pages filtered out of the results? (e.g. Wikipedia:Foo, User:Bar, Template:Baz)

In the past, when I wanted to filter it like that, I think I had to do a strpos() to search for every namespace prefix like "File:", "Template:", etc. There could even be some defunct namespaces in there; who knows.

h4ck3rm1k3 commented 6 years ago

If you can split out the namespace into a different column that would be fine. I do suppose that the main namespaces are the important ones.

h4ck3rm1k3 commented 6 years ago

This task was based on the idea of extracting articles from dumps. If the dumps are not happening fast enough then we should consider to go from the realtime feed instead. I am going to close this for now because it might not be productive to go after the dumps.

speedydeletion / wikiproc

Extract a list of deleted articles from the wikipedia #1