Closed blester125 closed 3 months ago
@conceptofmind was also asking me about mediawiki parsing, so maybe coordinate to make sure you aren't overlapping work. The main thing we discussed was that I suggested using ncermountain/wtf_wikipedia plus a little bit of postprocessing.
@conceptofmind was also asking me about mediawiki parsing, so maybe coordinate to make sure you aren't overlapping work. The main thing we discussed was that I suggested using ncermountain/wtf_wikipedia plus a little bit of postprocessing.
Will talk to Shayne and Luca to ensure that there is no overlap on this.
Subsumed into https://github.com/r-three/common-pile/issues/82
Several datasets some from Wikimedia sources and they provide data dumps. These dumps contain wikitext markup which can be parsed with libraries like wtf_wikipedia.