togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.57k stars 350 forks source link

Drive space to store #52

Closed tstandley closed 1 year ago

tstandley commented 1 year ago

Hey, could you list the final drive space to store the full dataset somewhere?

mauriceweber commented 1 year ago

Hi @tstandley ! The total size (uncompressed) is around 5T, compressed it is 3T. Hope this helps!