r-three / common-pile

Repo to hold code and track issues for the collection of permissively licensed data
MIT License
22 stars 6 forks source link

WikiDot #16

Open blester125 opened 10 months ago

blester125 commented 10 months ago

Domain: Wiki farm, lots of random context ~10m pages

Does not use wikitext - https://www.wikidot.com/doc-wiki-syntax:start Can probably just remove all [[...]] blocks, ||, and @@.

craffel commented 10 months ago

@adarob can you post any info on how to get the syntax out of a wikidot page and the little hacky parser I wrote?

StellaAthena commented 8 months ago

10 m pages will likely turn into less than 1B tokens. I recommend putting working on this off and instead focusing on larger or higher value datasets

craffel commented 8 months ago

I would consider this a highish-value dataset. 1B token doesn't seem that small to me either, given that we have so many sources. I agree that we shouldn't put a crazy amount of effort into this source though.

blester125 commented 3 months ago

Based on a simple check "wikidot" in m["metadata"]["identifier"] for m in metadata, none of the wikidot wikis are included in the wikiteam wikis on the Internet archive.

Based on https://github.com/search?q=repo%3Asaveweb%2Fwikiteam3%20wikidot&type=code it looks like the wikiteam3 scraper is able to handle wikidot so we should be able to scrape them ourselves pretty easily. Edit: Oops, I lied, they do a bunch of stuff to detect other wikiengines but that just gets used to reject scraping of non-media wiki sites.

As the wiki markup is different than standard wikitext we should probably try to keep the wikidot implemention separate from the rest of the wiki code

craffel commented 3 months ago

Agreed!