pediapress / mwlib

mediawiki parser library
103 stars 35 forks source link

Allow mw-buildcdb to ignore redirects #14

Closed UltraNurd closed 12 years ago

UltraNurd commented 12 years ago

I added an --ignore-redirects flag to mw-buildcdb, exposing the ignore_redirects init parameter in DumpParser. This allowed me to build a CDB from a dump file, so that when I iterate over all articles I only obtain full pages, not a duplicate of each page under each redirect title (which was resulting in my code doing a lot of unnecessary processing, and throwing off the stats I was trying to accumulate).

Another option would be to somehow flag whether or not a page was a redirect when building the CDB, which you could then check when iterating over the CDB, but this seemed to be the best way to address the issue I was running into that I posted about on the Google Group: http://groups.google.com/group/mwlib/browse_thread/thread/b22682b8da9563a5