r-three / common-pile

Repo to hold code and track issues for the collection of permissively licensed data
MIT License
23 stars 7 forks source link

Feat/wiki scraper #51

Closed blester125 closed 7 months ago

blester125 commented 1 year ago

This PR adds scripts that can be used to get an xml export of mediawiki sites that don't provide dumps. The resulting dump will contain a list of <page>, one for each exported page. Each page has multiple <revision> which can be used to create an author list. The most recent <revision>'s <text> can be used to get the mediawiki markup representation of the page to use as the document text.

An index of pages is built using the Special:AllPages query url and then exports are made using Special:Export.

StellaAthena commented 1 year ago

I think it's a good idea to replace Wikipedia's custom mathematics syntax with LaTeX. Does it make sense to do it at this stage of the pipeline, or later?

blester125 commented 1 year ago

@craffel Yeah it should be easy to parallelize this, it runs off files which list page titles (one per line) so you can parallelize over the files (and we are already parallelizing over wiki's) and we can split the inputs pretty easily for more parallelism.

@StellaAthena I think it would best to have that happen later. I was thinking after this export there would be a step that converts from these xml to dolma which would have raw wiki markup as the text field. Then the next step would be converting wikitext to plaintext and that is where the math conversion would happen.

blester125 commented 7 months ago

Datasets have been uploaded to https://huggingface.co/datasets/blester125/wiki-dolma

WikiMedia + Talk pages are cleaner and have 14.6 Billion tokens. WikiTeam3 wikis have 65.1 Billion tokens. They are less clean, various default/boilerplate pages pop up