huggingface / olm-datasets

Pipeline for pulling and processing online language model pretraining data from the web
Apache License 2.0
173 stars 23 forks source link

Resource used to produce this version of dataset? #4

Closed spate141 closed 1 year ago

spate141 commented 1 year ago

Can you provide any details about resources (CPUs, memory, storage, time) used to produce this dataset?

From the OLD/CC GitHub readme, I can estimate that, to get and process 20% of the August 2022 CC snapshot, which is 1.45 TB of data, requires about 15TB of storage disk and total deduplication will require about 700 to 900 GB of memory. But I can't find any details about how many CPUs were used and how much time it took to process it. Was this data processed on a single machine, single disk?

TristanThrush commented 1 year ago

Yes, single machine and single disk. Particularly n2d-standard-224 running Ubuntu 20.04 LTS. This machine has 224 CPUs and about 800GB of RAM. The disk that I used is about 15TB. The deduplication stage really pushes the RAM to its limit, so you need a machine with more RAM if you want to generate a dataset that is even a little larger. It takes about 2 days for the whole pipeline to run.

spate141 commented 1 year ago

Thank you for the information!