togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.51k stars 345 forks source link

Expected finish time for processing one single index of commoncrawl? #35

Closed kimcando closed 3 weeks ago

kimcando commented 1 year ago

One more question, please.

using the provided command, how long does it take to finish the each step(e.g, quality filtering, deduplication, quality classifier) for processing single index of commoncrawl(e.g, 2023-06 ) ?

Thank you!

mauriceweber commented 1 year ago

we used a machine with 64 cores and 512GB RAM and it took about 2-3 days for one CC dump to process with the cc_net pipeline. You can expect another day for deduplication and applying the quality classifier.

You can use the quality classifier that we have trained, so that you don't have to retrain it (this part of the readme points you the model).

newbietuan commented 1 year ago

hello, how much the disk space will need? about 100T?

kimcando commented 1 year ago

@mauriceweber a single machine with 64 cores and 512 GB? for a single index?

mauriceweber commented 3 weeks ago

yes, this is correct.