khorneflaeks / Minipedia

Taking the first paragraph of every English Wikipedia article and storing it in a 5 bit file.
https://minipedia.xyz
GNU General Public License v3.0
0 stars 0 forks source link

Pulling the entirety of Wikipedia. #4

Closed Webatron11 closed 2 years ago

Webatron11 commented 2 years ago

We can use the "all pages" function of the Wikipedia API to pull 500 page titles at once. It also helpfully gives us a continuation point to work from.

all pages call -> 500 page titles -> list of titles -> pull page (x10) -> return to 0 -> continue --------------------------------------------------------^

Webatron11 commented 2 years ago

Basic scraper working, possible issues with excontinue being thrown when scraping https://www.mediawiki.org/wiki/Extension:TextExtracts

Example situation: https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts&generator=allpages&exintro=1&explaintext=1&gapcontinue=Z-order&gapfilterredir=nonredirects&gaplimit=80

Webatron11 commented 2 years ago

Scraper works, singlefile support being worked on