itkach / mwscrape

Download rendered articles from MediaWiki API to CouchDB
Mozilla Public License 2.0
21 stars 5 forks source link

enable usage of https://dumps.wikimedia.org/ #19

Closed drogga3 closed 4 years ago

drogga3 commented 4 years ago

why scrape an online instance of mediawiki when you can do it offline?

itkach commented 4 years ago

Previous generation of aard dictionary/tools did just that, however as Wikipedia evolved it became increasingly more difficult to maintain and generally impossible to achieve parity in rendering quality with Mediawiki (software that powers Wikipedia sites). The key elements that make it so are incredible complexity of wiki markup and templating language, introduction of Lua-based templates and migration of factual data (for infoboxes and perhaps other page elements) to a database. To put it short, only Mediawiki can properly render Mediawiki content. So getting the rendered content directly from a Mediawiki site via provided api is the only viable option.

drogga3 commented 4 years ago

Fair point. I guess that's the only way to do it now. Thank you for the thorough explaination.