Closed mamiller615 closed 1 week ago
Hi,
Thanks for reporting the issue. I will take a look and get back to you.
Note: there is an unmasked e-mail address in your last log file snippet.
Edit: I have edited your message to remove it
Bug confirmed - I have created a new integration test case that demonstrates the bug in the current version of Slurm-Mail.
I am working on a fix.
Interestingly, the Slurm job ID for a cancelled job array that never dispatched is of the form X_[Y-Z]
, e.g.:
1_[1-5]
Issue fixed in release 4.21.
Thanks again for reporting the issue.
Thanks for the quick response!
Many thanks for the sponsorship - that's my first one ever!
Versions
OS version: Rocky Linux 9.4 Slurm version: 22.05.9-1 Slurm Mail version: 4.20
Describe the bug
We have seen that if a user submits an array job, and cancels it before any tasks start, the slurm-send-mail program will generate an error message in the /var/log/slurm-mail/slurm-send-mail.log log file, and slurm-mail file for the job is not deleted. In our case, thousands of files accumulated over several months and slurm-send-email continually trying to reprocess them. We ended up just deleting these older slurm-email files.
To replicate this, I submitted a simple shell script as an array job and immediately canceled the job:
and the following messages were seen in the slurm-send-mail.log file.
It seemed like the "jobs" python list was empty in this situation, so I was able to fix (or at least avoid) the issue by modifying line 360 in /usr/lib/python3.9/site-packages/slurmmail/cli.p from:
to:
With this change in place, the problematic slurm-email file was processed, no errors arose, and no email was sent, which I think is fine.
Further testing shows that slurm-email is working as it should.
While the change I made works, there emay be a more intelligent way to deal with the situation.
Logs
Same as above example...
Thanks for all of your work in putting out a great Email too for SLURM!!