uwefladrich / scriptengine-tasks-hpc

ScriptEngine Task set for HPC systems
GNU General Public License v3.0
2 stars 2 forks source link

Unresolved reference to se command (without full path) in sbatch task #31

Open jmrgonza opened 1 week ago

jmrgonza commented 1 week ago

Hi, I found a problem to execute the slurm.sbatch task in ECMWF's hpc-2020 platform. This is the error message that I get:

2024-10-21 12:35:13 INFO [se.task:hpc.slurm.sbatch <cbc31b40cd>] Submitting job to SLURM queue
[ECMWF-ERROR-sbatch] - [Errno 2] No such file or directory: 'se'
2024-10-21 12:35:13 ERROR [se.task:hpc.slurm.sbatch <cbc31b40cd>] SLURM sbatch error: Command '['sbatch', '--parsable', '--export', 'ALL', '--qos', 'np', '--time', '5', '--output', 'U48F.log', '--job-name', 'ECE4_U48F', '--ntasks', '128', '--nodes', '1', '--', 'se', 'user-config-example.yml', '../platforms/ecmwf-hpc2020-intel+openmpi.yml', 'experiment-config-example.yml', 'scriptlib/main.yml']' returned non-zero exit status 1.
2024-10-21 12:35:13 ERROR [se.instance.simplescriptengine] STOPPING SimpleScriptEngine due to task error in hpc.slurm.sbatch <cbc31b40cd>
2024-10-21 12:35:13 ERROR [se.instance.simplescriptengine] For more debugging info, re-run with loglevel DEBUG

I think the problem was introduce after #29 was merged. I reverted those changes in my fork and the error dissapeared.

There is a related issue in the ec-earth gitlab portal (login required): https://git.smhi.se/ec-earth/ecearth4/-/issues/82

uwefladrich commented 1 week ago

Hi @jmrgonza,

it will be a bit difficult for me because I cannot reproduce the issue easily, but lets try. As a first shot, could you try this modification:

diff --git a/src/hpctasks/slurm.py b/src/hpctasks/slurm.py
index 30309db..b2ca0fe 100644
--- a/src/hpctasks/slurm.py
+++ b/src/hpctasks/slurm.py
@@ -68,7 +68,7 @@ class Sbatch(Task):

         sbatch_cmd_line.append("--")  # make sure further opts go to se command

-        sbatch_cmd_line.append("se")
+        sbatch_cmd_line.append(sys.argv[0])

         scripts = self.getarg("scripts", context, default=None)
         if scripts:
jmrgonza commented 1 week ago

Your fix worked for me.

uwefladrich commented 6 days ago

Okay, good to hear! There is a certain logic to it, however, the issue is now to ensure that this change would not break things in other configurations. I suspect that there could be problems.

Would you please extend the test like

diff --git a/src/hpctasks/slurm.py b/src/hpctasks/slurm.py
index 30309db..833e0b5 100644
--- a/src/hpctasks/slurm.py
+++ b/src/hpctasks/slurm.py
@@ -68,7 +68,8 @@ class Sbatch(Task):

         sbatch_cmd_line.append("--")  # make sure further opts go to se command

-        sbatch_cmd_line.append("se")
+        sbatch_cmd_line.append(sys.argv[0])
+        self.log_debug(f"Actual se command: {sys.argv[0]}")

         scripts = self.getarg("scripts", context, default=None)
         if scripts:

and report what the log line says on your system?

jmrgonza commented 6 days ago

This is what the log says:

2024-10-29 14:06:25 DEBUG [se.task:hpc.slurm.sbatch <d016456683>] Actual se command: /perm/spk/virtual-envs/ece4-new/bin/se

I executed scriptengine without the full path:

se --loglevel=debug user-config-example.yml ../platforms/ecmwf-hpc2020-intel+openmpi.yml experiment-config-example.yml scriptlib/main.yml
uwefladrich commented 6 days ago

Okay, thanks! Need to think about it and also test the change in other setups...