Open blester125 opened 10 months ago
@adarob can you post any info on how to get the syntax out of a wikidot page and the little hacky parser I wrote?
10 m pages will likely turn into less than 1B tokens. I recommend putting working on this off and instead focusing on larger or higher value datasets
I would consider this a highish-value dataset. 1B token doesn't seem that small to me either, given that we have so many sources. I agree that we shouldn't put a crazy amount of effort into this source though.
Based on a simple check "wikidot" in m["metadata"]["identifier"] for m in metadata
, none of the wikidot wikis are included in the wikiteam wikis on the Internet archive.
Based on https://github.com/search?q=repo%3Asaveweb%2Fwikiteam3%20wikidot&type=code it looks like the wikiteam3 scraper is able to handle wikidot so we should be able to scrape them ourselves pretty easily. Edit: Oops, I lied, they do a bunch of stuff to detect other wikiengines but that just gets used to reject scraping of non-media wiki sites.
As the wiki markup is different than standard wikitext we should probably try to keep the wikidot implemention separate from the rest of the wiki code
Agreed!
Domain: Wiki farm, lots of random context ~10m pages
Does not use wikitext - https://www.wikidot.com/doc-wiki-syntax:start Can probably just remove all [[...]] blocks, ||, and @@.