Closed blester125 closed 7 months ago
I think it's a good idea to replace Wikipedia's custom mathematics syntax with LaTeX. Does it make sense to do it at this stage of the pipeline, or later?
@craffel Yeah it should be easy to parallelize this, it runs off files which list page titles (one per line) so you can parallelize over the files (and we are already parallelizing over wiki's) and we can split the inputs pretty easily for more parallelism.
@StellaAthena I think it would best to have that happen later. I was thinking after this export there would be a step that converts from these xml to dolma which would have raw wiki markup as the text
field. Then the next step would be converting wikitext to plaintext and that is where the math conversion would happen.
Datasets have been uploaded to https://huggingface.co/datasets/blester125/wiki-dolma
WikiMedia + Talk pages are cleaner and have 14.6 Billion tokens. WikiTeam3 wikis have 65.1 Billion tokens. They are less clean, various default/boilerplate pages pop up
This PR adds scripts that can be used to get an xml export of mediawiki sites that don't provide dumps. The resulting dump will contain a list of
<page>
, one for each exported page. Each page has multiple<revision>
which can be used to create an author list. The most recent<revision>
's<text>
can be used to get the mediawiki markup representation of the page to use as the document text.An index of pages is built using the
Special:AllPages
query url and then exports are made usingSpecial:Export
.