itkach / slob

Data store for Aard 2
GNU General Public License v3.0
241 stars 32 forks source link

Wikimedia data dumps #38

Closed opk12 closed 2 years ago

opk12 commented 2 years ago

README.md's section Create from MediaWiki sites does not mention https://meta.wikimedia.org/wiki/Data_dumps as Wikimedia publishes database dumps for all wikis, including Wikipedia and Wiktionary, updated monthly or twice a month. Importing the dumps is faster and lighter on resources than crawling, and crawlers seem to be rate-limited.

itkach commented 2 years ago

README does not mention it because the tool to convert mediawiki data dumps into slob hasn't been created, and likely never will be. Tools for slob's predecessor did work with mediawiki data dumps (using mwlib) and the results were decent, but it never could fully match how Mediawiki renders the same content, and over time the gap was only increasing, especially with introduction of Lua-based templates and moving Infobox and other data elements to a separate database. The practical reality is that the only software that can render Mediwiki properly is Mediawiki itself. Don't let this section fool you: I guarantee you none of these are even close to being adequate. Yes, getting rendered articles via Mediawiki takes a while (mwscrape tries to be respectful and is deliberately limited so as to minimize burden it may put on Wikipedia servers), but it works well enough and is fast enough for all practical purposes.

itkach commented 2 years ago

While browsing https://meta.wikimedia.org/wiki/Data_dumps I stumbled a new type of dump seemingly published since few months ago: https://dumps.wikimedia.org/other/enterprise_html/. These include rendered article html. This is a different story! Looks like these may be a viable alternative, I will look into this.