A number of famous datasets have a large % of their corpora that we want to include. The purpose of this issue is to make it easy to track our coverage of these datasets. A check in the box means we're finished with this dataset, either because the code is done or because we've rejected it. See the attached issue or note for further details.
A number of famous datasets have a large % of their corpora that we want to include. The purpose of this issue is to make it easy to track our coverage of these datasets. A check in the box means we're finished with this dataset, either because the code is done or because we've rejected it. See the attached issue or note for further details.
The Pile
Books3Skipping for licensing reasonsOpenSubtitlesSkipping for licensing reasons https://github.com/r-three/licensed-pile/issues/23BookCorpus2Skipping for licensing reasonsHacker Newshttps://github.com/r-three/licensed-pile/issues/6YouTube SubtitlesSkipping for licensing reasonsPhilPapersSkipping for licensing reasonsThe Stack
Red Pajamas
Booksskipping for licensing reasonsSilo LM
HackerNewshttps://github.com/r-three/licensed-pile/issues/6