togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.53k stars 346 forks source link

how much disk memory will be used? #47

Open newbietuan opened 1 year ago

newbietuan commented 1 year ago

hello, there. i want to get the zh data of one dump. How much disk space will be occupied during data download and processing, and the final data size

mauriceweber commented 1 year ago

Hi @newbietuan -- the ccnet pipeline processes the warc files on the fly, so you won't need to store an entire cc dump on disk. I cannot say how much space the minified zh output will be, but as a guideline: for en, the output of the mined 2023-06 cc dump is around 800G.

I hope this helps!

newbietuan commented 1 year ago

Thank you very much. @mauriceweber

Sorry for replying late.

during the pipeline process, the wet_cache will be deleted automatically? when i run for test, it seems not deleted. so 800G is the final output, how much memory need for the whole memory? depends on the snapshot? around 60-100T? may i know your config.json and the configuration of the machine, memory, cpu, disk, time et.al. i have no idea about what configuration should I plan to get the data

newbietuan commented 1 year ago

Hi @newbietuan -- the ccnet pipeline processes the warc files on the fly, so you won't need to store an entire cc dump on disk. I cannot say how much space the minified zh output will be, but as a guideline: for en, the output of the mined 2023-06 cc dump is around 800G.

I hope this helps!

hi, @mauriceweber when i run python -m cc_net --config config/my_config.json using the { "hash_in_mem": 50, "dump": "2023-06", "num_shards": 8, "task_parallelism": 48, "num_segments_per_shard": -1, "mine_num_processes": 48, "cleanup_after_regroup": "True", "lang_whitelist": ["zh"], "pipeline": [ "dedup", "lid", "keep_lang", "sp", "lm", "pp_bucket", "minify", "split_by_segment" ], "execution": "debug", "output_dir": "zh_data", "mined_dir": "zh_mined_by_segment", "target_size": "1GB", "cache_dir": "zh_data/wet_cache" } the first shard contain 11000 files of .warc.wet.gz. the download speed is about 12M/s, so it seems download all the 8 shards will take about 800 hours~~ i noticed the parameter cleanup_after_regroup, during test, after copy output files form zh_mined_by_segment_split to zh_mined_by_segment, nothing is deleted. Does this have something to do with the size setting of the parameter target_size? Does the size of target_size refer to the size of the final json file?

i also have some confusions about the parameters of task_parallelism and mine_num_processes, After the first shard is downloaded, whether subsequent downloads and processing can be executed in parallel?

now i have a machine 64cpus 512GRAM 1T Memory ,download speed is about 12M/s, Whether the configuration can complete the processing of a snapshot