interTwin-eu / interlink-slurm-plugin

MIT License
1 stars 3 forks source link

Long Json passed in env variable breaks singularity call #27

Closed antoinetran closed 4 days ago

antoinetran commented 1 week ago

Short Description of the issue

Running an Argo workflow with a pod step affinity set to InterLink Slurm node, will break with this error

XXX:~/.interlink$ kubectl -n argo-workflows logs XXX  -f -c init
FATAL:   While checking container encryption: could not open image /home/username/.interlink/ARGO_PROGRESS_PATCH_TICK_DURATION=1m0s,ARGO_PROGRESS_FILE_TICK_DURATION=3s,ARGO_POD_NAME=containerXXXname,ARGO_CONTAINER_NAME=init,ARGO_TEMPLATE=inputs:parameters:[name:input_data,ARGO_INCLUDE_SCRIPT_OUTPUT=false,ARGO_DEADLINE=0001-01-01T00:00:00Z,ARGO_PROGRESS_FILE=/var/run/argo/progress,ARGO_POD_UID=6c9f1e72-862c-4795-ae25-930919fd9438,GODEBUG=x509ignoreCN=0,ARGO_WORKFLOW_NAME=globalfitmock-lsdsharedfs-template-n7ch4,ARGO_WORKFLOW_UID=8b0c0b74-bf78-4916-babc-42ae4ffc6899,ARGO_NODE_ID=containerYYYYYname: failed to retrieve path for /home/username/.interlink/ARGO_PROGRESS_PATCH_TICK_DURATION=1m0s,ARGO_PROGRESS_FILE_TICK_DURATION=3s,ARGO_POD_NAME=containerXXXname,ARGO_CONTAINER_NAME=init,ARGO_TEMPLATE=inputs:parameters:[name:input_data,ARGO_INCLUDE_SCRIPT_OUTPUT=false,ARGO_DEADLINE=0001-01-01T00:00:00Z,ARGO_PROGRESS_FILE=/var/run/argo/progress,ARGO_POD_UID=6c9f1e72-862c-4795-ae25-930919fd9438,GODEBUG=x509ignoreCN=0,ARGO_WORKFLOW_NAME=globalfitmock-lsdsharedfs-template-n7ch4,ARGO_WORKFLOW_UID=8b0c0b74-bf78-4916-babc-42ae4ffc6899,ARGO_NODE_ID=containerYYYYYname: lstat /home/username/.interlink/ARGO_PROGRESS_PATCH_TICK_DURATION=1m0s,ARGO_PROGRESS_FILE_TICK_DURATION=3s,ARGO_POD_NAME=containerXXXname,ARGO_CONTAINER_NAME=init,ARGO_TEMPLATE=inputs:parameters:[name:input_data,ARGO_INCLUDE_SCRIPT_OUTPUT=false,ARGO_DEADLINE=0001-01-01T00:00:00Z,ARGO_PROGRESS_FILE=: file name too long

Environment

Steps to reproduce

Run an Argo workflow and put a helloworld step in InterLink Slurm node thanks to taint and affinity:

        "affinity": {
          "nodeAffinity": {
            "preferredDuringSchedulingIgnoredDuringExecution": [
              {
                "weight": 1,
                "preference": {
                  "matchExpressions": [
                    {
                      "key": "kubernetes.io/hostname",
                      "operator": "In",
                      "values": [
                        "interlink-slurm"
                      ]
                    }
                  ]
                }
              }
            ]
          }

Logs, stacktrace, or other symptoms

Argo Workflow pod when editing it:

167   - command:
168     - argoexec
169     - init
170     - --loglevel
171     - info
172     - --log-format
173     - text
174     env:
175     - name: ARGO_POD_NAME
176       valueFrom:
177         fieldRef:
178           apiVersion: v1
179           fieldPath: metadata.name
180     - name: ARGO_POD_UID
181       valueFrom:
182         fieldRef:
183           apiVersion: v1
184           fieldPath: metadata.uid
185     - name: GODEBUG
186       value: x509ignoreCN=0
187     - name: ARGO_WORKFLOW_NAME
188       value: XXXXXXXXXXXXXx
189     - name: ARGO_WORKFLOW_UID
190       value: 8b0c0b74-bf78-4916-babc-42ae4ffc6899
191     - name: ARGO_CONTAINER_NAME
192       value: init
193     - name: ARGO_TEMPLATE
194       value: '{"name":"group-block","inputs":{"parameters":[{"name":"input_data","value":"stub1_output_preprocessed_signal.h5"},{"name":"pipeline_run_id","value":"XXXXXXXXXXXXXXXXX"},{"name":"block_label",    "value":"282"},{"name":"group_label","value":"stub1"},{"name":"block_static_configuration_reference","value":"dummy_config.json"},{"name":"workdir","value":"/work"},{"name":"bus_uri","value":"nats://my-nats.nats:4222"},{"name":"    n_steps_per_iteration","value":"1000"},{"name":"current_iteration","value":"0"}]},"outputs":{},"affinity":{"nodeAffinity":{"preferredDuringSchedulingIgnoredDuringExecution":[{"weight":1,"preference":{"matchExpressions":[{"key":"    kubernetes.io/hostname","operator":"In","values":["interlink-slurm"]}]}}]}},"metadata":{},"container":{"name":"","image":"imaaaaaaaaaagge","command":["XXXXXXXXXXX"],"args":["--workdir=/work","    --pipeline-run-id=XXXXXXXXXXXXXXXX","--servers=nats://my-nats.nats:4222","start-block","--input-data=stub1_output_preprocessed_signal.h5","--input-group-configuration-path=/work/stub1_group_configuration.    json","--block-label=282","--group-label=stub1","--n-steps=1000","--current-iteration=0","--output-checkpoint=/work/XXXXXXXXXXXX"],"resources":{"limits":{"cpu":"500m","memory":"256Mi"},"requests":{"cpu":"500m","memory":"2    56Mi"}},"volumeMounts":[{"name":"working-volume","mountPath":"/work"}]},"tolerations":[{"key":"virtual-node.interlink/no-schedule","operator":"Exists"}]}'
195     - name: ARGO_NODE_ID
196       value: XXXXXXXXXXXXXXXXXXX
197     - name: ARGO_INCLUDE_SCRIPT_OUTPUT
198       value: "false"
199     - name: ARGO_DEADLINE
200       value: "0001-01-01T00:00:00Z"
201     - name: ARGO_PROGRESS_FILE
202       value: /var/run/argo/progress
203     - name: ARGO_PROGRESS_PATCH_TICK_DURATION
204       value: 1m0s
205     - name: ARGO_PROGRESS_FILE_TICK_DURATION
206       value: 3s
207     image: quay.io/argoproj/argoexec:v3.5.4
208     imagePullPolicy: Always
209     name: init
210     resources:
211       limits:
212         cpu: 100m
213         memory: 128Mi
214       requests:
215         cpu: 50m
216         memory: 16Mi

The job.sh runs 3 singularity container, because Argo as for each pod an init, wait and main containers. Here is an extract of the init container:

singularity exec --containall --nv   --env ARGO_CONTAINER_NAME=init,ARGO_TEMPLATE={"name":"group-block","inputs":{"parameters":[{"name":"input_data","value":"stub2_output_preprocessed_signal.h5"},{"name":"pipeline_run_id","value":"XXXXXXXXXXXXX"},{"name":"block_label","value":"282"},{"name":"group_label","value":"stub2"},{"name":"block_static_configuration_reference","value":"dummy_config.json"},{"name":"workdir","value":"/work"},{"name":"bus_uri","value":"nats://my-nats.nats:4222"},{"name":"n_steps_per_iteration","value":"1000"},{"name":"current_iteration","value":"0"}]},"outputs":{},"affinity":{"nodeAffinity":{"preferredDuringSchedulingIgnoredDuringExecution":[{"weight":1,"preference":{"matchExpressions":[{"key":"kubernetes.io/hostname","operator":"In","values":["interlink-slurm"]}]}}]}},"metadata":{},"container":{"name":"","image":"XXXXXXXXXXXXX","command":["lgfk-cli"],"args":["--workdir=/work","--pipeline-run-id=XXXXXXXXXX","--servers=nats://my-nats.nats:4222","start-block","--input-data=XXXXXXXXXXX","--input-group-configuration-path=/work/stub2_group_configuration.json","--block-label=282","--group-label=stub2","--n-steps=1000","--current-iteration=0","--output-checkpoint=/work/XXXXXXXXXXXXX.dump"],"resources":{"limits":{"cpu":"500m","memory":"256Mi"},"requests":{"cpu":"500m","memory":"256Mi"}},"volumeMounts":[{"name":"working-volume","mountPath":"/work"}]},"tolerations":[{"key":"virtual-node.interlink/no-schedule","operator":"Exists"}]},ARGO_INCLUDE_SCRIPT_OUTPUT=false,ARGO_PROGRESS_PATCH_TICK_DURATION=1m0s,ARGO_PROGRESS_FILE_TICK_DURATION=3s,ARGO_POD_UID=216626ae-df79-441d-a58a-9a1f657ecee6,GODEBUG=x509ignoreCN=0,ARGO_WORKFLOW_NAME=XXXXXXXXXXXXXX,ARGO_DEADLINE=0001-01-01T00:00:00Z,ARGO_PROGRESS_FILE=/var/run/argo/progress,ARGO_POD_NAME=XXXXXXXXXXXXXn7ch4-group-block-4006580661,ARGO_WORKFLOW_UID=8b0c0b74-bf78-4916-babc-42ae4ffc6899,ARGO_NODE_ID=XXXXXXXXXXXXn7ch4-4006580661  --bind /home/XXXX/.interlink/argo-workflows-216626ae-df79-441d-a58a-9a1f657ecee6/emptyDirs/var-run-argo:/var/run/argo:rw  --bind /home/XXXXXX/.interlink/argo-workflows-216626ae-df79-441d-a58a-9a1f657ecee6/command_init.sh:/tmp/command_init.sh --bind /home/XXXX/.interlink/argo-workflows-216626ae-df79-441d-a58a-9a1f657ecee6/args_init.sh:/tmp/args_init.sh quay.io/argoproj/argoexec:v3.5.4 /bin/sh /tmp/command_init.sh &> /home/XXXXXXXXX/.interlink/argo-workflows-216626ae-df79-441d-a58a-9a1f657ecee6/init.out; echo $? > /home/XXXX/.interlink/argo-workflows-216626ae-df79-441d-a58a-9a1f657ecee6/init.status

Analyze

Multiple issues here:

  1. the command line is too long: it is over the 4096 limit of a Linux command, resulting in file name too long
  2. the environment ARGO_TEMPLATE contains the Json of the Kubernetes pod to run by Argo. When Interlink Slurm Plugin runs singularity, double quotes are evaluated by the bash command and lost
  3. the singularity env (https://docs.sylabs.io/guides/3.7/user-guide/environment_and_metadata.html#env-option) relies on "," to separate, and in ARGO_TEMPLATE there are multiple "," because it is a Json, resulting in a bad variable parsing
  4. less important but the Argo runs the docker image like quay.io/... , but singularity will fail if this is not prefixed by docker:/docker.io/ => out of scope of this ticket, moved to https://github.com/interTwin-eu/interlink-slurm-plugin/issues/30

Summary of proposed changes

antoinetran commented 1 week ago

Suggestion of a solution: rework the plugin to put all environment variable either as envfile (https://docs.sylabs.io/guides/3.7/user-guide/environment_and_metadata.html#env-file-option), or as env variable prefixed by SINGULARITYENV_ (https://docs.sylabs.io/guides/3.7/user-guide/environment_and_metadata.html#singularityenv-prefix)

But worked in my test, however, both requires some form of escaping, like simple quote at least. But if we have simple quote inside the env variable (we cannot be certain what the user want to pass), then it would fail. To be completely reliable, we should parse the quote and escape them for each env var.

antoinetran commented 1 week ago

Testing with SINGULARITYENV_prefix results in a warning:

INFO:    Environment variable SINGULARITYENV_ARGO_TEMPLATE is set, but APPTAINERENV_ARGO_TEMPLATE is preferred

Whether we have singularity or apptainer, we don't know, so I recommend using envfile instead.

dciangot commented 1 week ago

@antoinetran +1 for envfile, regarding escaping I propose to target it after the envfile fix is in place and working (opening a dedicated issue now).

Do you have a patch that you want to merge already for the envfile ?

antoinetran commented 1 week ago

@antoinetran +1 for envfile, regarding escaping I propose to target it after the envfile fix is in place and working (opening a dedicated issue now).

Do you have a patch that you want to merge already for the envfile ?

No patch yet, I start working on it now. It is my first time programming in GO, so it will take some time :) especially if I want to incorporate Unit Tests with weird case of simple quote double quote inside env.

dciangot commented 1 week ago

Right, for the moment we could rely on introducing a case into the e2e integration suite. I'll open an issue there.

Executive message, you can concentrate of making changes, tests are going to be handled together with interlink core (at least the functional ones)