project-akri / akri

A Kubernetes Resource Interface for the Edge
https://docs.akri.sh/
Apache License 2.0
1.1k stars 142 forks source link

Pods generated for jobs contain two consecutive dashes #447

Closed jiria closed 2 years ago

jiria commented 2 years ago

Describe the bug Pods generated on behalf of job broker type contain two consecutive dashes as if there was some token missing.

Output of kubectl get pods,akrii,akric -o wide

NAME                                    CAPACITY   AGE
configuration.akri.sh/akri-debug-echo   2          5m1s

NAME                                      CONFIG            SHARED   NODES                       AGE
instance.akri.sh/akri-debug-echo-a19705   akri-debug-echo   true     ["jirka-virtual-machine"]   4m45s
instance.akri.sh/akri-debug-echo-8120fe   akri-debug-echo   true     ["jirka-virtual-machine"]   4m46s

NAME                                              READY   STATUS      RESTARTS   AGE
pod/akri-controller-deployment-5465b887d7-jfd7q   1/1     Running     0          5m1s
pod/akri-debug-echo-discovery-daemonset-fplhn     1/1     Running     0          5m1s
pod/akri-agent-daemonset-8d55d                    1/1     Running     0          5m1s
pod/akri-debug-echo-8120fe-1-job--1-khzg7         0/1     Completed   0          4m46s
pod/akri-debug-echo-a19705-1-job--1-48s8q         0/1     Completed   0          4m45s

Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]

k3s version v1.22.6+k3s1 (3228d9cb)
go version go1.16.10

To Reproduce Steps to reproduce the behavior:

  1. Create cluster using on top of Ubuntu 20.04, using latest K3s.
  2. Install Akri with the Helm command:
    helm install akri akri-helm-charts/akri-dev  $AKRI_HELM_CRICTL_CONFIGURATION  --set agent.allowDebugEcho=true  --set debugEcho.discovery.enabled=true  --set debugEcho.configuration.brokerJob.image.repository=busybox  --set debugEcho.configuration.brokerJob.command[0]="sh"  --set debugEcho.configuration.brokerJob.command[1]="-c"  --set debugEcho.configuration.brokerJob.command[2]="echo 'Hello World'"  --set debugEcho.configuration.brokerJob.command[3]="sleep 5"  --set debugEcho.configuration.enabled=true 
  3. Look at pods:
    watch kubectl get akric,akrii,pod  

Expected behavior Would expect the generated pod name was either akri-debug-echo-a19705-1-job-1-48s8q or akri-debug-echo-a19705-1-job-something-1-48s8q, but not akri-debug-echo-a19705-1-job--1-48s8q.

Logs (please share snips of applicable logs) Snippet from the controller log:

2022-02-18T22:43:11Z INFO  controller::util::instance_action] handle_instance - added or modified Akri Instance Some("akri-debug-echo-8120fe"): InstanceSpec { configuration_name: "akri-debug-echo", broker_properties: {"DEBUG_ECHO_DESCRIPTION": "foo0"}, shared: true, nodes: ["jirka-virtual-machine"], device_usage: {"akri-debug-echo-8120fe-1": "", "akri-debug-echo-8120fe-0": ""} }
[2022-02-18T22:43:11Z TRACE controller::util::instance_action] handle_instance_change - enter Add
[2022-02-18T22:43:11Z TRACE akri_shared::akri::configuration] find_configuration enter
[2022-02-18T22:43:11Z TRACE akri_shared::akri::configuration] find_configuration getting instance with name akri-debug-echo
[2022-02-18T22:43:11Z TRACE akri_shared::akri::configuration] find_configuration return
[2022-02-18T22:43:11Z TRACE controller::util::instance_action] handle_instance_change_job - enter Add
[2022-02-18T22:43:11Z TRACE controller::util::instance_action] handle_instance_change_job - instance added
[2022-02-18T22:43:11Z TRACE akri_shared::k8s::job] create_new_job_from_spec enter
[2022-02-18T22:43:11Z TRACE akri_shared::k8s::job] create_new_job_from_spec return
[2022-02-18T22:43:11Z TRACE akri_shared::k8s::job] create_job enter
[2022-02-18T22:43:11Z TRACE controller::util::pod_watcher] handle_pod - enter [event: Applied(Pod { metadata: ObjectMeta { annotations: None, cluster_name: None, creation_timestamp: Some(Time(2022-02-18T22:43:11Z)), deletion_grace_period_seconds: None, deletion_timestamp: None, finalizers: Some(["batch.kubernetes.io/job-tracking"]), generate_name: Some("akri-debug-echo-8120fe-1-job--1-"), generation: None, labels: Some({"akri.sh/configuration": "akri-debug-echo", "akri.sh/instance": "akri-debug-echo-8120fe", "controller-uid": "9ccc2d33-6e0e-48fa-b295-322e5c519e41", "job-name": "akri-debug-echo-8120fe-1-job"}), managed_fields: Some([ManagedFieldsEntry { api_version: Some("v1"), fields_type: Some("FieldsV1"), fields_v1: Some(FieldsV1(Object({"f:metadata": Object({"f:finalizers": Object({".": Object({}), "v:\"batch.kubernetes.io/job-tracking\"": Object({})}), "f:generateName": Object({}), "f:labels": Object({".": Object({}), "f:akri.sh/configuration": Object({}), "f:akri.sh/instance": Object({}), "f:controller-uid": Object({}), "f:job-name": Object({})}), "f:ownerReferences": Object({".": Object({}), "k:{\"uid\":\"9ccc2d33-6e0e-48fa-b295-322e5c519e41\"}": Object({})})}), "f:spec": Object({"f:containers": Object({"k:{\"name\":\"akri-debug-echo-broker\"}": Object({".": Object({}), "f:command": Object({}), "f:image": Object({}), "f:imagePullPolicy": Object({}), "f:name": Object({}), "f:resources": Object({".": Object({}), "f:limits": Object({".": Object({}), "f:akri.sh/akri-debug-echo-8120fe": Object({}), "f:cpu": Object({}), "f:memory": Object({})}), "f:requests": Object({".": Object({}), "f:akri.sh/akri-debug-echo-8120fe": Object({}), "f:cpu": Object({}), "f:memory": Object({})})}), "f:terminationMessagePath": Object({}), "f:terminationMessagePolicy": Object({})})}), "f:dnsPolicy": Object({}), "f:enableServiceLinks": Object({}), "f:restartPolicy": Object({}), "f:schedulerName": Object({}), "f:securityContext": Object({}), "f:terminationGracePeriodSeconds": Object({})})}))), manager: Some("k3s"), operation: Some("Update"), time: Some(Time(2022-02-18T22:43:11Z)) }]), name: Some("akri-debug-echo-8120fe-1-job--1-khzg7"), namespace: Some("default"), owner_references: Some([OwnerReference { api_version: "batch/v1", block_owner_deletion: Some(true), controller: Some(true), kind: "Job", name: "akri-debug-echo-8120fe-1-job", uid: "9ccc2d33-6e0e-48fa-b295-322e5c519e41" }]), resource_version: Some("1433"), self_link: None, uid: Some("b4b80749-a803-44f1-aace-060ddf9cd86f") }, spec: Some(PodSpec { active_deadline_seconds: None, affinity: None, automount_service_account_token: None, containers: [Container { args: None, command: Some(["sh", "-c", "echo 'Hello World'", "sleep 5"]), env: None, env_from: None, image: Some("busybox:latest"), image_pull_policy: Some("Always"), lifecycle: None, liveness_probe: None, name: "akri-debug-echo-broker", ports: None, readiness_probe: None, resources: Some(ResourceRequirements { limits: Some({"akri.sh/akri-debug-echo-8120fe": Quantity("1"), "cpu": Quantity("29m"), "memory": Quantity("30Mi")}), requests: Some({"akri.sh/akri-debug-echo-8120fe": Quantity("1"), "cpu": Quantity("10m"), "memory": Quantity("10Mi")}) }), security_context: None, startup_probe: None, stdin: None, stdin_once: None, termination_message_path: Some("/dev/termination-log"), termination_message_policy: Some("File"), tty: None, volume_devices: None, volume_mounts: Some([VolumeMount { mount_path: "/var/run/secrets/kubernetes.io/serviceaccount", mount_propagation: None, name: "kube-api-access-vfsmm", read_only: Some(true), sub_path: None, sub_path_expr: None }]), working_dir: None }], dns_config: None, dns_policy: Some("ClusterFirst"), enable_service_links: Some(true), ephemeral_containers: None, host_aliases: None, host_ipc: None, host_network: None, host_pid: None, hostname: None, image_pull_secrets: None, init_containers: None, node_name: None, node_selector: None, overhead: None, preemption_policy: Some("PreemptLowerPriority"), priority: Some(0), priority_class_name: None, readiness_gates: None, restart_policy: Some("OnFailure"), runtime_class_name: None, scheduler_name: Some("default-scheduler"), security_context: Some(PodSecurityContext { fs_group: None, run_as_group: None, run_as_non_root: None, run_as_user: None, se_linux_options: None, supplemental_groups: None, sysctls: None, windows_options: None }), service_account: Some("default"), service_account_name: Some("default"), share_process_namespace: None, subdomain: None, termination_grace_period_seconds: Some(30), tolerations: Some([Toleration { effect: Some("NoExecute"), key: Some("node.kubernetes.io/not-ready"), operator: Some("Exists"), toleration_seconds: Some(300), value: None }, Toleration { effect: Some("NoExecute"), key: Some("node.kubernetes.io/unreachable"), operator: Some("Exists"), toleration_seconds: Some(300), value: None }]), topology_spread_constraints: None, volumes: Some([Volume { aws_elastic_block_store: None, azure_disk: None, azure_file: None, cephfs: None, cinder: None, config_map: None, csi: None, downward_api: None, empty_dir: None, fc: None, flex_volume: None, flocker: None, gce_persistent_disk: None, git_repo: None, glusterfs: None, host_path: None, iscsi: None, name: "kube-api-access-vfsmm", nfs: None, persistent_volume_claim: None, photon_persistent_disk: None, portworx_volume: None, projected: Some(ProjectedVolumeSource { default_mode: Some(420), sources: [VolumeProjection { config_map: None, downward_api: None, secret: None, service_account_token: Some(ServiceAccountTokenProjection { audience: None, expiration_seconds: Some(3607), path: "token" }) }, VolumeProjection { config_map: Some(ConfigMapProjection { items: Some([KeyToPath { key: "ca.crt", mode: None, path: "ca.crt" }]), name: Some("kube-root-ca.crt"), optional: None }), downward_api: None, secret: None, service_account_token: None }, VolumeProjection { config_map: None, downward_api: Some(DownwardAPIProjection { items: Some([DownwardAPIVolumeFile { field_ref: Some(ObjectFieldSelector { api_version: Some("v1"), field_path: "metadata.namespace" }), mode: None, path: "namespace", resource_field_ref: None }]) }), secret: None, service_account_token: None }] }), quobyte: None, rbd: None, scale_io: None, secret: None, storageos: None, vsphere_volume: None }]) }), status: Some(PodStatus { conditions: None, container_statuses: None, ephemeral_container_statuses: None, host_ip: None, init_container_statuses: None, message: None, nominated_node_name: None, phase: Some("Pending"), pod_ip: None, pod_ips: None, qos_class: Some("Burstable"), reason: None, start_time: None }) })]
[2022-02-18T22:43:11Z INFO  controller::util::pod_watcher] handle_pod - pod Some("akri-debug-echo-8120fe-1-job--1-khzg7") added or modified
kate-goldenring commented 2 years ago

Reproduce

I was able to reproduce this by installing the same version of K3s (v1.22.6+k3s1) which is their latest stable release. While Jobs (which are created by Akri) are named correctly, the Pods spun up by the Kubernetes [Job controller](https://kubernetes.io/docs/concepts/architecture/controller/) are not named as expected. It looks like the latest k3s is adding an extra--1`

 NAME                                              READY   STATUS      RESTARTS   AGE
pod/akri-debug-echo-discovery-daemonset-6ggfk     1/1     Running     0          6m22s
pod/akri-agent-daemonset-qx58w                    1/1     Running     0          6m22s
pod/akri-controller-deployment-5465b887d7-swwbf   1/1     Running     0          6m22s
pod/akri-debug-echo-a19705-1-job--1-ghx6b         0/1     Completed   0          6m1s
pod/akri-debug-echo-8120fe-1-job--1-j72x5         0/1     Completed   0          6m2s

NAME                                     COMPLETIONS   DURATION   AGE
job.batch/akri-debug-echo-8120fe-1-job   1/1           9s         6m2s
job.batch/akri-debug-echo-a19705-1-job   1/1           8s         6m1s

I then installed the same version of MicroK8s (v1.22.6-3+7ab10db7034594) and was able to reproduce the weird formatting:

NAME                                              READY   STATUS      RESTARTS   AGE
pod/akri-debug-echo-discovery-daemonset-6f5m4     1/1     Running     0          40s
pod/akri-agent-daemonset-bxzgw                    1/1     Running     0          40s
pod/akri-controller-deployment-5465b887d7-g9tvl   1/1     Running     0          40s
pod/akri-debug-echo-a19705-1-job--1-ts6n8         0/1     Completed   0          19s
pod/akri-debug-echo-8120fe-1-job--1-fld7p         0/1     Completed   0          20s

NAME                                     COMPLETIONS   DURATION   AGE
job.batch/akri-debug-echo-a19705-1-job   1/1           11s        19s
job.batch/akri-debug-echo-8120fe-1-job   1/1           13s        20s

Checking bookend versions

I was not able to reproduce this with the latest version of k3s that we run our end to end tests on, namely v1.21.5+k3s2. Pods are named as expected (ie akri-debug-echo-8120fe-1-job-khzg7). Output on a single node v1.21.5+k3s2 cluster:

NAME                                              READY   STATUS      RESTARTS   AGE
pod/akri-debug-echo-discovery-daemonset-p9x2f     1/1     Running     0          52s
pod/akri-controller-deployment-5465b887d7-xmgx2   1/1     Running     0          52s
pod/akri-agent-daemonset-frg65                    1/1     Running     0          52s
pod/akri-debug-echo-a19705-1-job-brz64            0/1     Completed   0          37s
pod/akri-debug-echo-8120fe-1-job-nn5mr            0/1     Completed   0          37s

NAME                                     COMPLETIONS   DURATION   AGE
job.batch/akri-debug-echo-a19705-1-job   1/1           13s        37s
job.batch/akri-debug-echo-8120fe-1-job   1/1           14s        37s

I then tried out MicroK8s 1.23 (v1.23.3-2+d441060727c463) and got the expected non-buggy behavior:

NAME                                             READY   STATUS      RESTARTS   AGE
pod/akri-agent-daemonset-49m5z                   1/1     Running     0          30s
pod/akri-controller-deployment-7675bdddf-7jnwj   1/1     Running     0          30s
pod/akri-debug-echo-discovery-daemonset-t46tk    1/1     Running     0          30s
pod/akri-debug-echo-8120fe-1-job-2w2bd           0/1     Completed   0          13s
pod/akri-debug-echo-a19705-1-job-djq7j           0/1     Completed   0          14s

NAME                                     COMPLETIONS   DURATION   AGE
job.batch/akri-debug-echo-a19705-1-job   1/1           8s         14s
job.batch/akri-debug-echo-8120fe-1-job   1/1           7s         13s

I also tried out the latest K3s version v1.23.3+k3s1 and got the expected non-buggy behavior:

NAME                                             READY   STATUS      RESTARTS   AGE
pod/akri-debug-echo-discovery-daemonset-pcnks    1/1     Running     0          53s
pod/akri-agent-daemonset-58sts                   1/1     Running     0          53s
pod/akri-controller-deployment-7675bdddf-mw8vg   1/1     Running     0          53s
pod/akri-debug-echo-8120fe-1-job-dg5p2           0/1     Completed   0          28s
pod/akri-debug-echo-a19705-1-job-7n2kh           0/1     Completed   0          29s

NAME                                     COMPLETIONS   DURATION   AGE
job.batch/akri-debug-echo-8120fe-1-job   1/1           7s         28s
job.batch/akri-debug-echo-a19705-1-job   1/1           8s         29s

Conclusion

It looks like this is a bug with Kubernetes version 1.22.6. It might be interesting to see if this was a known issue. Regardless, later versions have fixed it.

I think it is safe to remove this bug from Akri as (1) it is not naming the Pods only the jobs which are being named appropriately and (2) behaves as expected in the versions before and after 1.22 and (3) this syntax bug is not affecting Akri behavior.

romoh commented 2 years ago

Closing. See Kate's earlier comment.