r-three / common-pile

Repo to hold code and track issues for the collection of permissively licensed data
MIT License
22 stars 6 forks source link

Wikis that have dumps (e.g. wikimedia ones) #1

Closed blester125 closed 3 months ago

blester125 commented 10 months ago

Several datasets some from Wikimedia sources and they provide data dumps. These dumps contain wikitext markup which can be parsed with libraries like wtf_wikipedia.

craffel commented 9 months ago

@conceptofmind was also asking me about mediawiki parsing, so maybe coordinate to make sure you aren't overlapping work. The main thing we discussed was that I suggested using ncermountain/wtf_wikipedia plus a little bit of postprocessing.

conceptofmind commented 9 months ago

@conceptofmind was also asking me about mediawiki parsing, so maybe coordinate to make sure you aren't overlapping work. The main thing we discussed was that I suggested using ncermountain/wtf_wikipedia plus a little bit of postprocessing.

Will talk to Shayne and Luca to ensure that there is no overlap on this.

shayne-longpre commented 8 months ago

https://github.com/spencermountain/wtf_wikipedia

conceptofmind commented 7 months ago

https://github.com/spencermountain/dumpster-dip/issues/4

blester125 commented 3 months ago

Subsumed into https://github.com/r-three/common-pile/issues/82