togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

Invalid argument when running cc_net #82

Open Practicinginhell opened 7 months ago

Practicinginhell commented 7 months ago

Hi everyone, I try to run the cc net using this command python -m cc_net --dump 2023-06 --task_parallelism 20 --num_shards 5000 -l en --mine_num_processes 20 --hash_in_mem 1. But the invalid argument value for sequence type happened for -l argument. Thank you in advance for any help.

hicotton02 commented 7 months ago

the -l is for the language. This was for an older version of CC Net. The original project has been archived, but you can remove the "-l en" part and edit the file here: https://github.com/togethercomputer/RedPajama-Data/blob/rp_v1/data_prep/cc/cc_net/cc_net/mine.py#L88C37-L88C37

and add the languages you want. for example to just have en, you would do:

lang_whitelist: Sequence[str] = [ "en" ]

Practicinginhell commented 7 months ago

Thank you! I fixed it with the same way that you mentioned above. But I wonder why they don't update the Readme in cc_net module. I think this is a issue related to func_argparse that don't receive subsequent arguments as a Sequence because this error still happened even when I used the original cc_net repo