Closed spate141 closed 1 year ago
Yes, single machine and single disk. Particularly n2d-standard-224 running Ubuntu 20.04 LTS
. This machine has 224 CPUs and about 800GB of RAM. The disk that I used is about 15TB. The deduplication stage really pushes the RAM to its limit, so you need a machine with more RAM if you want to generate a dataset that is even a little larger. It takes about 2 days for the whole pipeline to run.
Thank you for the information!
Can you provide any details about resources (CPUs, memory, storage, time) used to produce this dataset?
From the OLD/CC GitHub readme, I can estimate that,
to get and process 20% of the August 2022 CC snapshot, which is 1.45 TB of data, requires about 15TB of storage disk and total deduplication will require about 700 to 900 GB of memory
. But I can't find any details about how many CPUs were used and how much time it took to process it. Was this data processed on a single machine, single disk?