open-mmlab / mim

MIM Installs OpenMMLab Packages
https://openmim.readthedocs.io/en/latest/
Apache License 2.0
346 stars 64 forks source link

Problems with using --srun-args "--async -o ${work_dir} " on Slurm #194

Open GhaSiKey opened 1 year ago

GhaSiKey commented 1 year ago

I tried to use mim to submit training tasks asynchronously on Slurm, using the following command: mim train mmcls resnet101_b16x8_cifar10.py --launcher slurm --gpus 1 --gpus-per-node 1 --partition aide_dev --work-dir tmp --srun-args "--async -o /mnt/petrelfs/gaoshiqi/" In order to be able to commit asynchronously on slurm and redirects the log to /mnt/petrelfs/gaoshiqi/, I added the parameter --srun-args "--async -o /mnt/petrelfs/gaoshiqi/" However, the execution of the command success but the task is not committed to the Slurm cluster, and I cant find my log /mnt/petrelfs/gaoshiqi/phoenix-slurm-5181985.out. the log is as follows: image Trying to find my log, but not exited: image

GhaSiKey commented 1 year ago

I found the cause of the problem: after using asynchronous submission, a batchscript is automatically generated, and I found a problem with the content. image image If the job-name parameter is not added, mim will automatically generate it, causing the batchscript content to be misplaced and thus causing the task submission to fail.

ice-tong commented 1 year ago

It seems a bug of srun: image