Resource used to produce this version of dataset?

spate141 commented 1 year ago

Can you provide any details about resources (CPUs, memory, storage, time) used to produce this dataset?

From the OLD/CC GitHub readme, I can estimate that, to get and process 20% of the August 2022 CC snapshot, which is 1.45 TB of data, requires about 15TB of storage disk and total deduplication will require about 700 to 900 GB of memory. But I can't find any details about how many CPUs were used and how much time it took to process it. Was this data processed on a single machine, single disk?

TristanThrush commented 1 year ago

Yes, single machine and single disk. Particularly n2d-standard-224 running Ubuntu 20.04 LTS. This machine has 224 CPUs and about 800GB of RAM. The disk that I used is about 15TB. The deduplication stage really pushes the RAM to its limit, so you need a machine with more RAM if you want to generate a dataset that is even a little larger. It takes about 2 days for the whole pipeline to run.

spate141 commented 1 year ago

Thank you for the information!

huggingface / olm-datasets

Resource used to produce this version of dataset? #4