Open Practicinginhell opened 7 months ago
the -l is for the language. This was for an older version of CC Net. The original project has been archived, but you can remove the "-l en" part and edit the file here: https://github.com/togethercomputer/RedPajama-Data/blob/rp_v1/data_prep/cc/cc_net/cc_net/mine.py#L88C37-L88C37
and add the languages you want. for example to just have en, you would do:
lang_whitelist: Sequence[str] = [ "en" ]
Thank you! I fixed it with the same way that you mentioned above. But I wonder why they don't update the Readme in cc_net module. I think this is a issue related to func_argparse that don't receive subsequent arguments as a Sequence because this error still happened even when I used the original cc_net repo
Hi everyone, I try to run the cc net using this command
python -m cc_net --dump 2023-06 --task_parallelism 20 --num_shards 5000 -l en --mine_num_processes 20 --hash_in_mem 1
. But the invalid argument value for sequence type happened for -l argument. Thank you in advance for any help.