togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

Unavailable Parameters #102

Open zhentingqi opened 4 months ago

zhentingqi commented 4 months ago

Hi! I am trying to download the crawl split 2023-50. I am running the command python -m cc_net --dump 2023-50, which raises the following error:

Will run cc_net.mine.main with the following config: Config(config_name='base', dump='2023-50', output_dir=PosixPath('data'), mined_dir='mined', execution='auto', num_shards=1600, min_shard=-1, num_segments_per_shard=-1, metadata=None, min_len=300, hash_in_mem=50, lang_whitelist=[], lang_blacklist=[], lang_threshold=0.5, keep_bucket=[], lm_dir=PosixPath('data/lm_sp'), cutoff=PosixPath('/n/home06/zhentingqi/RedPajama-Data/data_prep/cc/cc_net/cc_net/data/cutoff.csv'), lm_languages=None, mine_num_processes=16, target_size='4G', cleanup_after_regroup=False, task_parallelism=-1, pipeline=['dedup', 'lid', 'keep_lang', 'sp', 'lm', 'pp_bucket', 'drop', 'split_by_lang'], experiments=[], cache_dir=None)
Submitting _hashes_shard in a job array (1600 jobs)
Traceback (most recent call last):
  File "/n/sw/Mambaforge-23.3.1-1/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/n/sw/Mambaforge-23.3.1-1/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/n/home06/zhentingqi/RedPajama-Data/data_prep/cc/cc_net/cc_net/__main__.py", line 18, in <module>
    main()
  File "/n/home06/zhentingqi/RedPajama-Data/data_prep/cc/cc_net/cc_net/__main__.py", line 14, in main
    func_argparse.parse_and_call(cc_net.mine.get_main_parser())
  File "/n/home06/zhentingqi/.local/lib/python3.10/site-packages/func_argparse/__init__.py", line 72, in parse_and_call
    return command(**parsed_args)
  File "/n/home06/zhentingqi/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 638, in main
    all_files = mine(conf)
  File "/n/home06/zhentingqi/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 340, in mine
    hashes_groups = list(jsonql.grouper(hashes(conf), conf.hash_in_mem))
  File "/n/home06/zhentingqi/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 265, in hashes
    ex(_hashes_shard, repeat(conf), *_transpose(missing_outputs))
  File "/n/home06/zhentingqi/RedPajama-Data/data_prep/cc/cc_net/cc_net/execution.py", line 106, in map_array_and_wait
    jobs = ex.map_array(function, *args)
  File "/n/home06/zhentingqi/.local/lib/python3.10/site-packages/submitit/core/core.py", line 771, in map_array
    return self._internal_process_submissions(submissions)
  File "/n/home06/zhentingqi/.local/lib/python3.10/site-packages/submitit/auto/auto.py", line 218, in _internal_process_submissions
    return self._executor._internal_process_submissions(delayed_submissions)
  File "/n/home06/zhentingqi/.local/lib/python3.10/site-packages/submitit/slurm/slurm.py", line 332, in _internal_process_submissions
    array_ex.update_parameters(**self.parameters)
  File "/n/home06/zhentingqi/.local/lib/python3.10/site-packages/submitit/core/core.py", line 810, in update_parameters
    self._internal_update_parameters(**kwargs)
  File "/n/home06/zhentingqi/.local/lib/python3.10/site-packages/submitit/slurm/slurm.py", line 306, in _internal_update_parameters
    raise ValueError(
ValueError: Unavailable parameter(s): ['slurm_time']
Valid parameters are:
  - account (default: None)
  - additional_parameters (default: None)
  - array_parallelism (default: 256)
  - comment (default: None)
  - constraint (default: None)
  - cpus_per_gpu (default: None)
  - cpus_per_task (default: None)
  - dependency (default: None)
  - exclude (default: None)
  - exclusive (default: None)
  - gpus_per_node (default: None)
  - gpus_per_task (default: None)
  - gres (default: None)
  - job_name (default: 'submitit')
  - mail_type (default: None)
  - mail_user (default: None)
  - mem (default: None)
  - mem_per_cpu (default: None)
  - mem_per_gpu (default: None)
  - nodelist (default: None)
  - nodes (default: 1)
  - ntasks_per_node (default: None)
  - num_gpus (default: None)
  - partition (default: None)
  - qos (default: None)
  - setup (default: None)
  - signal_delay_s (default: 90)
  - srun_args (default: None)
  - stderr_to_stdout (default: False)
  - time (default: 5)
  - use_srun (default: True)
  - wckey (default: 'submitit')

Can someone please help me solve the problem? Thanks!