togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

Local data #90

Closed mauriceweber closed 7 months ago

mauriceweber commented 7 months ago

This PR allows to run the entire pipeline based on local data so that no S3 setup is required. This fixes #83 .