matbun commented 3 months ago

Short Description of the issue

When using secrets, the pod stays in PENDING state forever. Removing them, the pods is executed correctly.

Environment

InterTwin env on Vega.

Steps to reproduce

Create secrets:

kubectl create secret generic mlflow-server --from-literal=username=XXX --from-literal=password=XXX

Pod I am using:

apiVersion: v1
kind: Pod
metadata:
  name: 3dgan-train
  annotations:
    slurm-job.vk.io/flags: "-p gpu --gres=gpu:1 --ntasks-per-node=1 --nodes=1 --time=00:55:00"
    slurm-job.vk.io/singularity-mounts: "--bind /ceph/hpc/data/st2301-itwin-users/egarciagarcia:/exp_data"
    # slurm-job.vk.io/pre-exec: "singularity pull /ceph/hpc/data/st2301-itwin-users/itwinai_v9.5.sif docker://ghcr.io/intertwin-eu/itwinai:0.0.1-3dgan-0.4"
spec:
  automountServiceAccountToken: false
  containers:
  - args:
    - " cd /usr/src/app && itwinai exec-pipeline --print-config \
          --config $CERN_CODE_ROOT/config.yaml \ 
          --pipe-key training_pipeline \
          -o dataset_location=$CERN_DATA_ROOT \ 
          -o pipeline.init_args.steps.training_step.init_args.exp_root=$TMP_DATA_ROOT \ 
          -o logs_dir=$TMP_DATA_ROOT/ml_logs \ 
          -o distributed_strategy=$STRATEGY \ 
          -o devices=$DEVICES \ 
          -o hw_accelerators=$ACCELERATOR \ 
          -o checkpoints_path=$TMP_DATA_ROOT/checkpoints \
          -o max_samples=$MAX_DATA_SAMPLES \ 
          -o batch_size=$BATCH_SIZE \ 
          -o max_dataset_size=$NUM_WORKERS_DL "
    command:
    - /bin/sh
    - -c
    env:
    - name: CERN_DATA_ROOT
      value: "/exp_data"
    - name: CERN_CODE_ROOT
      value: "/usr/src/app"
    - name: TMP_DATA_ROOT
      value: "/exp_data"
    - name: MAX_DATA_SAMPLES
      value: "1000"
    - name: BATCH_SIZE
      value: "512"
    - name: NUM_WORKERS_DL
      value: "4"
    - name: ACCELERATOR
      value: "gpu"
    - name: STRATEGY
      value: "auto"
    - name: DEVICES
      value: "auto"

    - name: MLFLOW_TRACKING_USERNAME
      valueFrom:
        secretKeyRef:
          name: mlflow-server
          key: username
    - name: MLFLOW_TRACKING_PASSWORD
      valueFrom:
        secretKeyRef:
          name: mlflow-server
          key: password

    image: /ceph/hpc/data/st2301-itwin-users/itwinai_v9.5.sif
    imagePullPolicy: Always
    name: 3dgan-container
    resources:
      limits:
        cpu: "48"
        memory: 150Gi
      requests:
        cpu: "4"
        memory: 20Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  nodeSelector:
    kubernetes.io/hostname: vega-new-vk
  tolerations:
  - key: virtual-node.interlink/no-schedule
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

Logs, stacktrace, or other symptoms

NAME          READY   STATUS    RESTARTS   AGE
3dgan-train   0/1     Pending   0          12m

dciangot commented 3 months ago

@Surax98 any idea on how to tackle this?

Surax98 commented 2 months ago

@matbun the investigation took a bit since I had to dig into the Virtual Kubelet repository, since it seems related to their code, more than to the interLink provider itself. Let me explain a bit what's going on. Let's take as example this snippet from your pod:

   - name: MLFLOW_TRACKING_USERNAME
      valueFrom:
        secretKeyRef:
          name: mlflow-server
          key: username
    - name: MLFLOW_TRACKING_PASSWORD
      valueFrom:
        secretKeyRef:
          name: mlflow-server
          key: password

When using Secrets or ConfigMaps to set ENVs, a specific code on the Virtual Kubelet package is executed, leading to the retrieve of the specified resource. During this phase, the resource gathering, for a still under investigation reason, fails, probably due to a permission issue (bad cluster role?). Using ConfigMaps and Secrets as Volumes, works as expected, so you can workaround the issue by using Volumes, for the moment. By using a ClientSet within the interLink provider, I can easily retrieve everything, so it's seems a bit odd for the moment, until I don't get a much clearer understanding of the issue. For reference, you can see the executed Virtual Kubelet code at this link; this is the function used to populate ENVs for the container, both for ConfigMaps and Secrets, even if the link points to the Secret case only. Digging in the code takes time, so feel free to help if you want!

dciangot commented 2 months ago

Merged and fixed

interTwin-eu / interLink

Using secrets in env results in pod pending forever #263

Short Description of the issue

Environment

Steps to reproduce

Logs, stacktrace, or other symptoms