natefoo / slurm-drmaa

DRMAA for Slurm: Implementation of the DRMAA C bindings for Slurm
GNU General Public License v3.0
48 stars 22 forks source link

Fails with Slurm 18.08.8 #32

Closed pdblood closed 4 years ago

pdblood commented 5 years ago

Testing with the drmaa-run utility, I find that slurm-drmaa fails with the 18.08.8 release of Slurm, but the exact same procedure works fine with 18.08.7. With 18.08.8 it fails at the job run step:

E #2af1 [     0.77]  * fsd_exc_new(1001,slurm_submit_batch_job error (-1): Unspecified error,1)
t #2af1 [     0.77] -> slurmdrmaa_free_job_desc
t #2af1 [     0.77] <- slurmdrmaa_free_job_desc
t #2af1 [     0.77] <- drmaa_run_job=1: slurm_submit_batch_job error (-1): Unspecified error
F #2af1 [     0.77]  * Failed to submit a job: slurm_submit_batch_job error (-1): Unspecified error

Corresponding to this part of the drmaa-run code:

        /* run */
        if (api.run_job(jobid, sizeof(jobid) - 1, jt, errbuf, sizeof(errbuf) - 1) != DRMAA_ERRNO_SUCCESS) {
                fsd_log_fatal(("Failed to submit a job: %s ", errbuf));
                exit(2); /* TODO exception */

Slurm 18.08.8 addresses a security vulnerability that exists in prior versions of Slurm.

natefoo commented 4 years ago

I've just tried reproducing this with 18.08.8 and it worked for me. Can you include the debug log leading up to the exception?

natefoo commented 4 years ago

Nevermind, I see I have a bit more in email from you:

d #2af1 [     0.00]  * # Setting defaults for tasks and processors
d #2af1 [     0.00]  * # Native specification: -A pscstaff -p RM-small
t #2af1 [     0.00] -> slurmdrmaa_parse_native
d #2af1 [     0.00]  * # account = pscstaff
d #2af1 [     0.00]  * # partition = RM-small
d #2af1 [     0.00]  * finalizing job constraints
d #2af1 [     0.00]  * set min_cpus to ntasks: 1
t #2af1 [     0.00] <- slurmdrmaa_parse_native
E #2af1 [     0.77]  * fsd_exc_new(1001,slurm_submit_batch_job error (-1): Unspecified error,1)
t #2af1 [     0.77] -> slurmdrmaa_free_job_desc
t #2af1 [     0.77] <- slurmdrmaa_free_job_desc
t #2af1 [     0.77] <- drmaa_run_job=1: slurm_submit_batch_job error (-1): Unspecified error
F #2af1 [     0.77]  * Failed to submit a job: slurm_submit_batch_job error (-1): Unspecified error

This could be an issue with the native spec, I'll have a look at that.

pdblood commented 4 years ago

It turns out this error was being caused by a configuration issue that requires a job name be specified. With jobs submitted via sbatch, the name of the script was used when no job name was specified. Once the admin changed job_script.lua to handle nil values for job name, the tests with drmaa-run started working with Slurm 18 08.8. This did not fix my related issue with submitting jobs from Galaxy using slurm-drmaa, but drmaa-run now works as expected with Slurm 18.08.8.

pdblood commented 4 years ago

Closing this issue since this failure appears to have been due to a specific configuration detail in job_script.lua on the system running Slurm 18.08.8 that was different from the system I tested with Slurm 18.08.7, leading me to believe that there was an incompatibility with Slurm 18.08.8. After further testing, with drmaa-run, Slurm 18.08.8 appears to work as expected.

natefoo commented 4 years ago

Thanks for the update, I'd tried with Python drmaa and couldn't get it to fail, it's good to know what the issue was.