nmadhire / jwpl

Automatically exported from code.google.com/p/jwpl
0 stars 0 forks source link

[RevisionMachine] Missing revision histories for articles with colon in title #33

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Currently, all articles with prefixes in the title (like User:) are filtered by 
the RevisionMachine unless the prefix appears in a whitelist.
This way, only "normal" articles appear in the db PLUS everything you 
specifically define in the whitelist.
At the moment, a page is identified as having a prefix by looking for a colon 
in the title. There are, however, a few pages which have a colon in the title 
whitout using it for prefix demarkation. These pages will currently be lost. 
(<0.20%)
We therefore should adjust the filter and maybe go back to a (language 
dependent) blacklist filter.

Original issue reported on code.google.com by oliver.ferschke on 7 Jul 2011 at 10:28

GoogleCodeExporter commented 9 years ago

Original comment by oliver.ferschke on 12 Jul 2011 at 9:00

GoogleCodeExporter commented 9 years ago
The WikipediaXMLReader should read in the namespace mappings from the dump, 
which should then be used by the article name checker for filtering.

Original comment by oliver.ferschke on 13 Jul 2011 at 1:17

GoogleCodeExporter commented 9 years ago
Reimplemented ArticleFilter. It now uses namespace information from the 
siteinfo section of the xml dump.

Original comment by oliver.ferschke on 22 Jul 2011 at 3:35

GoogleCodeExporter commented 9 years ago

Original comment by oliver.ferschke on 22 Jul 2011 at 3:35

GoogleCodeExporter commented 9 years ago

Original comment by oliver.ferschke on 16 Feb 2012 at 1:24