Open newbietuan opened 1 year ago
Hi @newbietuan -- the ccnet pipeline processes the warc files on the fly, so you won't need to store an entire cc dump on disk. I cannot say how much space the minified zh output will be, but as a guideline: for en, the output of the mined 2023-06 cc dump is around 800G.
I hope this helps!
Thank you very much. @mauriceweber
Sorry for replying late.
during the pipeline process, the wet_cache will be deleted automatically? when i run for test, it seems not deleted. so 800G is the final output, how much memory need for the whole memory? depends on the snapshot? around 60-100T? may i know your config.json and the configuration of the machine, memory, cpu, disk, time et.al. i have no idea about what configuration should I plan to get the data
Hi @newbietuan -- the ccnet pipeline processes the warc files on the fly, so you won't need to store an entire cc dump on disk. I cannot say how much space the minified zh output will be, but as a guideline: for en, the output of the mined 2023-06 cc dump is around 800G.
I hope this helps!
hi, @mauriceweber
when i run python -m cc_net --config config/my_config.json
using the
{
"hash_in_mem": 50,
"dump": "2023-06",
"num_shards": 8,
"task_parallelism": 48,
"num_segments_per_shard": -1,
"mine_num_processes": 48,
"cleanup_after_regroup": "True",
"lang_whitelist": ["zh"],
"pipeline": [
"dedup",
"lid",
"keep_lang",
"sp",
"lm",
"pp_bucket",
"minify",
"split_by_segment"
],
"execution": "debug",
"output_dir": "zh_data",
"mined_dir": "zh_mined_by_segment",
"target_size": "1GB",
"cache_dir": "zh_data/wet_cache"
}
the first shard contain 11000 files of .warc.wet.gz. the download speed is about 12M/s, so it seems download all the 8 shards will take about 800 hours~~
i noticed the parameter cleanup_after_regroup, during test, after copy output files form zh_mined_by_segment_split to zh_mined_by_segment, nothing is deleted. Does this have something to do with the size setting of the parameter target_size? Does the size of target_size refer to the size of the final json file?
i also have some confusions about the parameters of task_parallelism and mine_num_processes, After the first shard is downloaded, whether subsequent downloads and processing can be executed in parallel?
now i have a machine 64cpus 512GRAM 1T Memory ,download speed is about 12M/s, Whether the configuration can complete the processing of a snapshot
hello, there. i want to get the zh data of one dump. How much disk space will be occupied during data download and processing, and the final data size