huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
2.05k stars 147 forks source link

no qos driving to invalid qos specification #258

Closed solene-evain closed 2 months ago

solene-evain commented 3 months ago

Hi everyone, I want to do deduplication so, for now, I'm running tests using minhash_deduplication.py. I'm using a server where I need to add account and contraint info so I added it in the script (modifying slurm.py also). My problem now, is that I cannot specify any qos for that server. This is set automatically... I tried commenting everything related to qos is those two scripts, but I still have this error:

2024-07-22 21:46:08.585 | INFO | datatrove.executor.slurm:launch_job:235 - Launching dependency job "mh3" 2024-07-22 21:46:08.585 | INFO | datatrove.executor.slurm:launch_job:235 - Launching dependency job "mh2" 2024-07-22 21:46:08.585 | INFO | datatrove.executor.slurm:launch_job:235 - Launching dependency job "mh1" 2024-07-22 21:46:08.591 | INFO | datatrove.executor.slurm:launch_job:270 - Launching Slurm job mh1 (1 tasks) with launch script "/lus/work/CT10/lig3801/sevain/try_datatrove//signatures/launch_script.slurm" sbatch: error: INFO : As you didn't ask threads_per_core in your request: 2 was taken as default sbatch: error: INFO : As you didn't ask ntasks or ntasks_per-node in your request, 1 task was taken as default sbatch: error: Batch job submission failed: Invalid qos specification Traceback (most recent call last): File "/lus/work/CT10/lig3801/sevain/try_datatrove/./minhash_deduplication.py", line 116, in <module> stage4.run() File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/site-packages/datatrove/executor/slurm.py", line 188, in run self.launch_job() File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/site-packages/datatrove/executor/slurm.py", line 236, in launch_job self.depends.launch_job() File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/site-packages/datatrove/executor/slurm.py", line 236, in launch_job self.depends.launch_job() File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/site-packages/datatrove/executor/slurm.py", line 236, in launch_job self.depends.launch_job() File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/site-packages/datatrove/executor/slurm.py", line 283, in launch_job self.job_id = launch_slurm_job(launch_file_contents, *args) File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/site-packages/datatrove/executor/slurm.py", line 375, in launch_slurm_job return subprocess.check_output(["sbatch", *args, f.name]).decode("utf-8").split()[-1] File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/subprocess.py", line 421, in check_output return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, File "/lus/home/CT10/lig3801/sevain/.conda/envs/datatrove/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['sbatch', '--export=NONE,RUN_OFFSET=0', '/tmp/tmpnif55bvt']' returned non-zero exit status 1. How can I have qos problem when I need not to specify one? It's driving me insane. If anyone could provide any help, I would be grateful! Thanks

guipenedo commented 3 months ago

Hi, can you try adding a print here https://github.com/huggingface/datatrove/blob/main/src/datatrove/executor/slurm.py#L367 so that we can see the contents of the generated sbatch script?

solene-evain commented 3 months ago

Hi @guipenedo,

here is the content of the generated sbatch script:

`#!/bin/bash

SBATCH --account=XXX(hiddenAccount)XXX

SBATCH --cpus-per-task=1

SBATCH --mem-per-cpu=2G

SBATCH --constraint=XXX(hiddenPartitionName)XXX

SBATCH --nodes=1

SBATCH --gpus-per-node=1

SBATCH --job-name=mh1

SBATCH --time=00:20:00

SBATCH --output=.//signatures/slurmlogs/%A%a.out

SBATCH --error=.//signatures/slurmlogs/%A%a.out

SBATCH --array=0-0

SBATCH --mail-type=ALL

SBATCH --mail-user=XXX(hiddenMail)XXX

echo "Starting data processing job mh1" conda init bash conda activate datatrove source ~/.bashrc set -xe export PYTHONUNBUFFERED=TRUE srun -l launch_pickled_pipeline /lus/work/try_datatrove//signatures/executor.pik `

guipenedo commented 3 months ago

It seems that indeed there is no --qos being set, but maybe you actually have to set one on your cluster? Also it seems the actual nb of total tasks isn't being set, did you also comment out that part? ok just saw --array

solene-evain commented 3 months ago

If I consider the server documentation, it's written "do not try to specify any qos, it's done automatically". So, just like me, you can't see any other mention to qos than what I already commented in the two scripts I mentioned? I tried to have a look to imported libraries just in case but I couldn't find anything.

(many thanks for the help)

guipenedo commented 3 months ago

I think your error message can also mean the specific combination of resources you are requesting is not allowed, I suggest you send the cluster admins your sbatch script and ask them if they can spot any issues

solene-evain commented 3 months ago

Thank you for the advice. I contacted them yesterday, I'm waiting for an answer!