huggingface / cosmopedia

Apache License 2.0
450 stars 45 forks source link

python deduplicate_dataset.py #12

Open simplew2011 opened 8 months ago

simplew2011 commented 8 months ago

https://github.com/huggingface/cosmopedia/blob/main/deduplication/deduplicate_dataset.py

2024-02-22 14:17:57.759 | INFO     | datatrove.executor.slurm:launch_job:216 - Launching dependency job "mh3"
2024-02-22 14:17:57.759 | INFO     | datatrove.executor.slurm:launch_job:216 - Launching dependency job "mh2"
2024-02-22 14:17:57.759 | INFO     | datatrove.executor.slurm:launch_job:216 - Launching dependency job "mh1"
2024-02-22 14:17:57.763 | INFO     | datatrove.executor.slurm:launch_job:249 - Launching Slurm job mh1 (120 tasks) with launch script "/home/wzp/code/LLMData/open_source/datatrove/data/minhash_logs/signatures/launch_script.slurm"
Traceback (most recent call last):
  File "/home/wzp/code/LLMData/open_source/datatrove/demo.py", line 110, in <module>
    stage4.run()
  File "/home/wzp/code/LLMData/open_source/datatrove/src/datatrove/executor/slurm.py", line 169, in run
    self.launch_job()
  File "/home/wzp/code/LLMData/open_source/datatrove/src/datatrove/executor/slurm.py", line 217, in launch_job
    self.depends.launch_job()
  File "/home/wzp/code/LLMData/open_source/datatrove/src/datatrove/executor/slurm.py", line 217, in launch_job
    self.depends.launch_job()
  File "/home/wzp/code/LLMData/open_source/datatrove/src/datatrove/executor/slurm.py", line 217, in launch_job
    self.depends.launch_job()
  File "/home/wzp/code/LLMData/open_source/datatrove/src/datatrove/executor/slurm.py", line 262, in launch_job
    self.job_id = launch_slurm_job(launch_file_contents, *args)
  File "/home/wzp/code/LLMData/open_source/datatrove/src/datatrove/executor/slurm.py", line 349, in launch_slurm_job
    return subprocess.check_output(["sbatch", *args, f.name]).decode("utf-8").split()[-1]
  File "/home/wzp/anaconda3/envs/3.10/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/home/wzp/anaconda3/envs/3.10/lib/python3.10/subprocess.py", line 503, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/home/wzp/anaconda3/envs/3.10/lib/python3.10/subprocess.py", line 971, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/home/wzp/anaconda3/envs/3.10/lib/python3.10/subprocess.py", line 1863, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
loubnabnl commented 7 months ago

Can you provide more details about your setup? e.g did you run it on a slurm cluster?