r-three / common-pile

Repo to hold code and track issues for the collection of permissively licensed data
MIT License
22 stars 6 forks source link

Famous Datasets Tracker #46

Open StellaAthena opened 9 months ago

StellaAthena commented 9 months ago

A number of famous datasets have a large % of their corpora that we want to include. The purpose of this issue is to make it easy to track our coverage of these datasets. A check in the box means we're finished with this dataset, either because the code is done or because we've rejected it. See the attached issue or note for further details.

The Pile

The Stack

Red Pajamas

Silo LM