togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

cc-net failure on slurm cluster #72

Closed hicotton02 closed 8 months ago

hicotton02 commented 9 months ago

I went from doing the cc-net pulls locally to using slurm. When I try to execute

theskaz@c4140:/nfs/slow/RedPajama-Data/data_prep/cc/cc_net$ python -m cc_net --dump 2020-05 --num_shards 5000 --hash_in_mem 1
Will run cc_net.mine.main with the following config: Config(config_name='base', dump='2020-05', output_dir=PosixPath('data'), mined_dir='mined', execution='auto', num_shards=5000, num_segments_per_shard=-1, metadata=None, min_len=300, hash_in_mem=1, lang_whitelist=['en'], lang_blacklist=[], lang_threshold=0.5, keep_bucket=[], lm_dir=PosixPath('data/lm_sp'), cutoff=PosixPath('/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/data/cutoff.csv'), lm_languages=None, mine_num_processes=16, target_size='4G', cleanup_after_regroup=True, task_parallelism=-1, pipeline=['dedup', 'lid', 'keep_lang', 'sp', 'lm', 'pp_bucket', 'drop', 'split_by_lang'], experiments=[], cache_dir=None)
Submitting _hashes_shard in a job array (3983 jobs)
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/__main__.py", line 18, in <module>
    main()
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/__main__.py", line 14, in main
    func_argparse.parse_and_call(cc_net.mine.get_main_parser())
  File "/home/theskaz/.local/lib/python3.10/site-packages/func_argparse/__init__.py", line 72, in parse_and_call
    return command(**parsed_args)
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 631, in main
    all_files = mine(conf)
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 334, in mine
    hashes_groups = list(jsonql.grouper(hashes(conf), conf.hash_in_mem))
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 263, in hashes
    ex(_hashes_shard, repeat(conf), *_transpose(missing_outputs))
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/execution.py", line 106, in map_array_and_wait
    jobs = ex.map_array(function, *args)
  File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/core/core.py", line 771, in map_array
    return self._internal_process_submissions(submissions)
  File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/auto/auto.py", line 218, in _internal_process_submissions
    return self._executor._internal_process_submissions(delayed_submissions)
  File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/slurm/slurm.py", line 328, in _internal_process_submissions
    array_ex.update_parameters(**self.parameters)
  File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/core/core.py", line 810, in update_parameters
    self._internal_update_parameters(**kwargs)
  File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/slurm/slurm.py", line 302, in _internal_update_parameters
    raise ValueError(
ValueError: Unavailable parameter(s): ['slurm_time']
Valid parameters are:
  - account (default: None)
  - additional_parameters (default: None)
  - array_parallelism (default: 256)
  - comment (default: None)
  - constraint (default: None)
  - cpus_per_gpu (default: None)
  - cpus_per_task (default: None)
  - exclude (default: None)
  - exclusive (default: None)
  - gpus_per_node (default: None)
  - gpus_per_task (default: None)
  - gres (default: None)
  - job_name (default: 'submitit')
  - mem (default: None)
  - mem_per_cpu (default: None)
  - mem_per_gpu (default: None)
  - nodes (default: 1)
  - ntasks_per_node (default: None)
  - num_gpus (default: None)
  - partition (default: None)
  - qos (default: None)
  - setup (default: None)
  - signal_delay_s (default: 90)
  - srun_args (default: None)
  - stderr_to_stdout (default: False)
  - time (default: 5)
  - wckey (default: 'submitit')

and I go into execution.py and comment out the slurm_time parameter (lines 58-61) and try again, that returns this error:

theskaz@c4140:/nfs/slow/RedPajama-Data/data_prep/cc/cc_net$ python -m cc_net --dump 2020-05 --num_shards 5000 --hash_in_mem 1
Will run cc_net.mine.main with the following config: Config(config_name='base', dump='2020-05', output_dir=PosixPath('data'), mined_dir='mined', execution='auto', num_shards=5000, num_segments_per_shard=-1, metadata=None, min_len=300, hash_in_mem=1, lang_whitelist=['en'], lang_blacklist=[], lang_threshold=0.5, keep_bucket=[], lm_dir=PosixPath('data/lm_sp'), cutoff=PosixPath('/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/data/cutoff.csv'), lm_languages=None, mine_num_processes=16, target_size='4G', cleanup_after_regroup=True, task_parallelism=-1, pipeline=['dedup', 'lid', 'keep_lang', 'sp', 'lm', 'pp_bucket', 'drop', 'split_by_lang'], experiments=[], cache_dir=None)
Submitting _hashes_shard in a job array (3983 jobs)
sbatch: error: Batch job submission failed: Invalid job array specification
subprocess.CalledProcessError: Command '['sbatch', '/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/data/logs/submission_file_4751207924ea4dde903eace6afeb2a38.sh']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/__main__.py", line 18, in <module>
    main()
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/__main__.py", line 14, in main
    func_argparse.parse_and_call(cc_net.mine.get_main_parser())
  File "/home/theskaz/.local/lib/python3.10/site-packages/func_argparse/__init__.py", line 72, in parse_and_call
    return command(**parsed_args)
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 631, in main
    all_files = mine(conf)
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 334, in mine
    hashes_groups = list(jsonql.grouper(hashes(conf), conf.hash_in_mem))
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/mine.py", line 263, in hashes
    ex(_hashes_shard, repeat(conf), *_transpose(missing_outputs))
  File "/nfs/slow/RedPajama-Data/data_prep/cc/cc_net/cc_net/execution.py", line 106, in map_array_and_wait
    jobs = ex.map_array(function, *args)
  File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/core/core.py", line 771, in map_array
    return self._internal_process_submissions(submissions)
  File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/auto/auto.py", line 218, in _internal_process_submissions
    return self._executor._internal_process_submissions(delayed_submissions)
  File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/slurm/slurm.py", line 332, in _internal_process_submissions
    first_job: core.Job[tp.Any] = array_ex._submit_command(self._submitit_command_str)
  File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/core/core.py", line 934, in _submit_command
    output = utils.CommandFunction(command_list, verbose=False)()  # explicit errors
  File "/home/theskaz/.local/lib/python3.10/site-packages/submitit/core/utils.py", line 352, in __call__
    raise FailedJobError(stderr) from subprocess_error
submitit.core.utils.FailedJobError: sbatch: error: Batch job submission failed: Invalid job array specification

Not sure where to go from here. I can verify that slurm is working and all compute nodes are in the idle state.