Closed solene-evain closed 2 months ago
Hi, can you try adding a print here https://github.com/huggingface/datatrove/blob/main/src/datatrove/executor/slurm.py#L367 so that we can see the contents of the generated sbatch script?
Hi @guipenedo,
here is the content of the generated sbatch script:
`#!/bin/bash
echo "Starting data processing job mh1" conda init bash conda activate datatrove source ~/.bashrc set -xe export PYTHONUNBUFFERED=TRUE srun -l launch_pickled_pipeline /lus/work/try_datatrove//signatures/executor.pik `
It seems that indeed there is no --qos
being set, but maybe you actually have to set one on your cluster? Also it seems the actual nb of total tasks isn't being set, did you also comment out that part? ok just saw --array
If I consider the server documentation, it's written "do not try to specify any qos, it's done automatically". So, just like me, you can't see any other mention to qos than what I already commented in the two scripts I mentioned? I tried to have a look to imported libraries just in case but I couldn't find anything.
(many thanks for the help)
I think your error message can also mean the specific combination of resources you are requesting is not allowed, I suggest you send the cluster admins your sbatch script and ask them if they can spot any issues
Thank you for the advice. I contacted them yesterday, I'm waiting for an answer!
Hi everyone, I want to do deduplication so, for now, I'm running tests using minhash_deduplication.py. I'm using a server where I need to add account and contraint info so I added it in the script (modifying slurm.py also). My problem now, is that I cannot specify any qos for that server. This is set automatically... I tried commenting everything related to qos is those two scripts, but I still have this error:
2024-07-22 21:46:08.585 | INFO | datatrove.executor.slurm:launch_job:235 - Launching dependency job "mh3" 2024-07-22 21:46:08.585 | INFO | datatrove.executor.slurm:launch_job:235 - Launching dependency job "mh2" 2024-07-22 21:46:08.585 | INFO | datatrove.executor.slurm:launch_job:235 - Launching dependency job "mh1" 2024-07-22 21:46:08.591 | INFO | datatrove.executor.slurm:launch_job:270 - Launching Slurm job mh1 (1 tasks) with launch script "/lus/work/CT10/lig3801/sevain/try_datatrove//signatures/launch_script.slurm" sbatch: error: INFO : As you didn't ask threads_per_core in your request: 2 was taken as default sbatch: error: INFO : As you didn't ask ntasks or ntasks_per-node in your request, 1 task was taken as default sbatch: error: Batch job submission failed: Invalid qos specification Traceback (most recent call last): File "/lus/work/CT10/lig3801/sevain/try_datatrove/./minhash_deduplication.py", line 116, in <module> stage4.run() File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/site-packages/datatrove/executor/slurm.py", line 188, in run self.launch_job() File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/site-packages/datatrove/executor/slurm.py", line 236, in launch_job self.depends.launch_job() File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/site-packages/datatrove/executor/slurm.py", line 236, in launch_job self.depends.launch_job() File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/site-packages/datatrove/executor/slurm.py", line 236, in launch_job self.depends.launch_job() File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/site-packages/datatrove/executor/slurm.py", line 283, in launch_job self.job_id = launch_slurm_job(launch_file_contents, *args) File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/site-packages/datatrove/executor/slurm.py", line 375, in launch_slurm_job return subprocess.check_output(["sbatch", *args, f.name]).decode("utf-8").split()[-1] File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/subprocess.py", line 421, in check_output return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['sbatch', '--export=NONE,RUN_OFFSET=0', '/tmp/tmpnif55bvt']' returned non-zero exit status 1.
How can I have qos problem when I need not to specify one? It's driving me insane. If anyone could provide any help, I would be grateful! Thanks