Open GhaSiKey opened 1 year ago
I found the cause of the problem: after using asynchronous submission, a batchscript is automatically generated, and I found a problem with the content. If the job-name parameter is not added, mim will automatically generate it, causing the batchscript content to be misplaced and thus causing the task submission to fail.
It seems a bug of srun:
I tried to use mim to submit training tasks asynchronously on Slurm, using the following command:
mim train mmcls resnet101_b16x8_cifar10.py --launcher slurm --gpus 1 --gpus-per-node 1 --partition aide_dev --work-dir tmp --srun-args "--async -o /mnt/petrelfs/gaoshiqi/"
In order to be able to commit asynchronously on slurm and redirects the log to /mnt/petrelfs/gaoshiqi/, I added the parameter--srun-args "--async -o /mnt/petrelfs/gaoshiqi/"
However, the execution of the command success but the task is not committed to the Slurm cluster, and I cant find my log /mnt/petrelfs/gaoshiqi/phoenix-slurm-5181985.out. the log is as follows: Trying to find my log, but not exited: