neilmunday / slurm-mail

Slurm-Mail is a drop in replacement for Slurm's e-mails to give users much more information about their jobs compared to the standard Slurm e-mails.
GNU General Public License v3.0
94 stars 37 forks source link

slurm-send-mail causes error if array job cancelled before any tasks start #141

Closed mamiller615 closed 1 week ago

mamiller615 commented 2 weeks ago

Versions

OS version: Rocky Linux 9.4 Slurm version: 22.05.9-1 Slurm Mail version: 4.20

Describe the bug

We have seen that if a user submits an array job, and cancels it before any tasks start, the slurm-send-mail program will generate an error message in the /var/log/slurm-mail/slurm-send-mail.log log file, and slurm-mail file for the job is not deleted. In our case, thousands of files accumulated over several months and slurm-send-email continually trying to reprocess them. We ended up just deleting these older slurm-email files.

To replicate this, I submitted a simple shell script as an array job and immediately canceled the job:

[USER@jhpce01 class-scripts]$ sbatch --array=1-5 --mail-type=FAIL,END --mail-user=MYEMAIL script1
Submitted batch job 9575212
[USER@jhpce01 class-scripts]$ scancel 9575212
[USER@jhpce01 class-scripts]$ squeue --me
[USER@jhpce01 class-scripts]$

and the following messages were seen in the slurm-send-mail.log file.

. . .
2024/08/28 12:29:00:INFO: processing: /var/spool/slurm-mail/9575210_1724862485.197255.mail
2024/08/28 12:29:00:INFO: Sending e-mail to: ANOTHERUSER using ANOTHERUSER-EMAIL for job 9575210 (Ended) via SMTP server localhost:25
2024/08/28 12:29:00:INFO: Deleting: /var/spool/slurm-mail/9575210_1724862485.197255.mail
2024/08/28 12:29:00:INFO: processing: /var/spool/slurm-mail/9575212_1724862516.499688.mail
2024/08/28 12:29:00:ERROR: Failed to process: /var/spool/slurm-mail/9575212_1724862516.499688.mail
2024/08/28 12:29:00:ERROR: list index out of range
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/slurmmail/cli.py", line 971, in send_mail_main
    __process_spool_file(f, smtp_conn, options)
  File "/usr/lib/python3.9/site-packages/slurmmail/cli.py", line 361, in __process_spool_file
    jobs = [jobs[0]]
IndexError: list index out of range

It seemed like the "jobs" python list was empty in this situation, so I was able to fix (or at least avoid) the issue by modifying line 360 in /usr/lib/python3.9/site-packages/slurmmail/cli.p from:

    if array_summary or len(jobs) == 1:

to:

    if ( ( array_summary and (len(jobs) != 0) ) or ( len(jobs) == 1) ):

With this change in place, the problematic slurm-email file was processed, no errors arose, and no email was sent, which I think is fine.

2024/08/28 12:31:49:INFO: processing: /var/spool/slurm-mail/9575212_1724862516.499688.mail
2024/08/28 12:31:49:INFO: Deleting: /var/spool/slurm-mail/9575212_1724862516.499688.mail

Further testing shows that slurm-email is working as it should.

While the change I made works, there emay be a more intelligent way to deal with the situation.

Logs

Same as above example...

. . .
2024/08/28 12:29:00:INFO: processing: /var/spool/slurm-mail/9575210_1724862485.197255.mail
2024/08/28 12:29:00:INFO: Sending e-mail to: lthuytra using EMAIL_ADDRESS for job 9575210 (Ended) via SMTP server localhost:25
2024/08/28 12:29:00:INFO: Deleting: /var/spool/slurm-mail/9575210_1724862485.197255.mail
2024/08/28 12:29:00:INFO: processing: /var/spool/slurm-mail/9575212_1724862516.499688.mail
2024/08/28 12:29:00:ERROR: Failed to process: /var/spool/slurm-mail/9575212_1724862516.499688.mail
2024/08/28 12:29:00:ERROR: list index out of range
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/slurmmail/cli.py", line 971, in send_mail_main
    __process_spool_file(f, smtp_conn, options)
  File "/usr/lib/python3.9/site-packages/slurmmail/cli.py", line 361, in __process_spool_file
    jobs = [jobs[0]]
IndexError: list index out of range

Thanks for all of your work in putting out a great Email too for SLURM!!

neilmunday commented 2 weeks ago

Hi,

Thanks for reporting the issue. I will take a look and get back to you.

neilmunday commented 2 weeks ago

Note: there is an unmasked e-mail address in your last log file snippet.

Edit: I have edited your message to remove it

neilmunday commented 2 weeks ago

Bug confirmed - I have created a new integration test case that demonstrates the bug in the current version of Slurm-Mail.

I am working on a fix.

neilmunday commented 2 weeks ago

Interestingly, the Slurm job ID for a cancelled job array that never dispatched is of the form X_[Y-Z], e.g.:

1_[1-5]
neilmunday commented 1 week ago

Issue fixed in release 4.21.

Thanks again for reporting the issue.

mamiller615 commented 1 week ago

Thanks for the quick response!

neilmunday commented 1 week ago

Many thanks for the sponsorship - that's my first one ever!