tazend / me

0 stars 0 forks source link

slurm errors #1

Open tazend opened 1 year ago

tazend commented 1 year ago

@tazend, many thanks for this and the ingenious bug-finding....

First of all, my test and CORE clusters are set up completely identically. The only difference between the two is that the test cluster consists of real machines and the CORE cluster of cloud instances.

Whether one can really ignore this is really the question here, which I am still trying to answer through error analysis. I see the error message under SLURM on the CORE system:

SlurmCTLD: 2023-02-13T14:28:34.802] JobId=370421 nhosts:1 ncpus:1 node_req:64000 nodes=CompNode01 [2023-02-13T14:28:34.802] Node[0]: [2023-02-13T14:28:34.802] Mem(MB):15998:0 Sockets:1 Cores:6 CPUs:6:0 [2023-02-13T14:28:34.802] Socket[0] Core[0] is allocated [2023-02-13T14:28:34.802] Socket[0] Core[1] is allocated [2023-02-13T14:28:34.802] Socket[0] Core[2] is allocated [2023-02-13T14:28:34.802] Socket[0] Core[3] is allocated [2023-02-13T14:28:34.802] Socket[0] Core[4] is allocated [2023-02-13T14:28:34.802] Socket[0] Core[5] is allocated [2023-02-13T14:28:34.802] -------------------- [2023-02-13T14:28:34.802] cpu_array_value[0]:6 reps:1 [2023-02-13T14:28:34.802] ==================== [2023-02-13T14:28:34.803] sched/backfill: _start_job: Started JobId=370421 in Artificial on CompNode01 [2023-02-13T14:28:34.910] _slurm_rpc_requeue: Requeue of JobId=370421 returned an error: Only batch jobs are accepted or processed [2023-02-13T14:28:34.914] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=370421 uid 0 [2023-02-13T14:28:34.915] job_signal: 9 of running JobId=370421 successful 0x8004 [2023-02-13T14:28:35.917] _slurm_rpc_complete_job_allocation: JobId=370421 error Job/step already completing or completed

SlurmD: [370420.extern] fatal: Could not create domain socket: Operation not permitted [2023-02-13T14:13:12.412] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable [2023-02-13T14:13:12.417] Could not launch job 370420 and not able to requeue it, cancelling job

And with this, the SlurmD process aborts the processing and reports back to the CTLD that the JOB cannot be executed. And I find absolutely no explanation for this. I only see on both sides CTLD and SlurmD, the "unauthorised credential for client .....". - How did you solve the problem in the end? With this FLAG under MUNGE or rather under SLURM? Best regards from Berlin.....

Z. Matthias

Originally posted by @ZXRobotum in https://github.com/dun/munge/issues/130#issuecomment-1430150026

tazend commented 1 year ago

Hi @ZXRobotum

I just created this new issue in a personal repo to not further hijack the issue in the munge repository for an error which is probably unrelated to munge, so we can discuss here.

Looking at the logs, it becomes more evident that the Unauthorized credential ... message is not the real problem in your case., especially if you use slurm 22.05 onwards. That message is as mentioned just shown by munged because of slurms safety check to see if munge was configured to allow root to decode any credential.

There is no direct fix for this log-message itself - unless the slurm devs choose to change the code which triggers this message in the munged.

So in your case and looking at the logs, the problem is that you are not able to launch any jobs?

ZXRobotum commented 1 year ago

Hello @tazend,

Yes I think that's great that we're going to continue to explore this in this area. And yes, I find this really fascinating, because -as already said- my test-cluster runs flawlessly with the identical compile-options, source-code, etc.

Yes, you saw that right, that my official cluster currently can't run JOBs asführen & that is of course modest, because the users of R-Studio & A.I. are again urgently waiting for this service....

What could I try? What do you think would be useful? I have also already posted this problem in the general eMail address of SLUM.... How can we exchange more on this? Bet regards from Berlin

Z. Matthias

tazend commented 1 year ago

Hi @ZXRobotum

judging from the logs again, it looks like the Users are using srun and want to interactively start their programs on the compute-nodes, right? The slurmd output also mentions an extern step, which somehow fails due to Operation not permitted. I assume in the slurm config you have something set for PrologFlags? For example PrologFlags=contain. (scontrol show conf | grep -i prologflags) If something is set for PrologFlags, I'd try and change the slurm config and remove anything set for PrologFlags, and test out if that works.

You can also try to start the slurmd manually as root in debug-mode, to perhaps more verbose output - you can do that with slurmd -D -vvvv (which makes slurmd run in the foreground and output everything in the terminal)

ZXRobotum commented 1 year ago

Hi @tazend, Thank you very much for the help and enquiry.

I also suspected the "PrologFlags" and have already set up my "slurm.conf" in a really minimalist way. Now I have these two prolog flags in the system: Alloc,Contain

Everything I try at the moment is stuck with this error:

slurmd: debug2: Start processing RPC: REQUEST_LAUNCH_PROLOG slurmd: debug2: Processing RPC: REQUEST_LAUNCH_PROLOG slurmd: CPU_BIND: _convert_job_mem: Memory extracted from credential for StepId=370437.extern job_mem_limit= 15998 slurmd: debug3: _spawn_prolog_stepd: call to _forkexec_slurmstepd slurmd: debug3: slurmstepd rank 0 (AI-CompNode01), parent rank -1 (NONE), children 0, depth 0, max_depth 0 slurmd: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable slurmd: debug3: _spawn_prolog_stepd: return from _forkexec_slurmstepd -1 slurmd: Could not launch job 370437 and not able to requeue it, cancelling job slurmd: debug3: in the service_connection slurmd: debug2: Start processing RPC: REQUEST_TERMINATE_JOB slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB slurmd: debug: _rpc_terminate_job: uid = 403 JobId=370437 slurmd: debug2: Finish processing RPC: REQUEST_LAUNCH_PROLOG slurmd: debug: credential for job 370437 revoked slurmd: debug2: No steps in jobid 370437 to send signal 18 slurmd: debug2: No steps in jobid 370437 to send signal 15 slurmd: debug4: sent ALREADY_COMPLETE slurmd: debug2: set revoke expiration for jobid 370437 to 1676812821 UTS slurmd: debug2: Finish processing RPC: REQUEST_TERMINATE_JOB

I have also started the "slurmd" process by hand, as you recommended. All the loading of the individual modules in advance is running flawlessly and without errors. I really haven't discovered anything here that could be wrong.

In the "slurmd.log" I only see the following with "DebugLevel=9":

[2023-02-19T13:36:59.349] [370440.extern] fatal: Could not create domain socket: Operation not permitted [2023-02-19T13:36:59.356] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable [2023-02-19T13:36:59.356] debug3: _spawn_prolog_stepd: return from _forkexec_slurmstepd -1 [2023-02-19T13:36:59.360] Could not launch job 370440 and not able to requeue it, cancelling job

I really don't know where else to start. Another operating system? Not DEBIAN but UBUNTU? No idea..... Best regards from Berlin

Z. Matthias

tazend commented 1 year ago

Hi @ZXRobotum

My guess is that it has something to do with the "Alloc,Contain" prolog flags. Maybe you could try to remove the PrologFlags alltogether from the slurm.conf and then try to run a job? Just to see if that works

ZXRobotum commented 1 year ago

Hello @tazend,

I would like to thank you again for your help/support and also apologise again for my long reply.

I took your suggestions on board and really tested it, but it wasn't because of that. So I continued to search and sank into the depths of the system....

In the end, the problem was due to SSH itself in connection with one of the latest updates of LINUX. SSH has had a sub-process and some other changes for some time, so that the "slurmstepd" came to an abort in some constellations. I was able to fix the WHOLE thing and now the service/cluster runs perfectly again.

Kind regards

Z. Matthias