virtual-kubelet / azure-aci

Things related to Azure Container Instances for Virtual Kubelet
Apache License 2.0
92 stars 71 forks source link

I notice some confusion around the creation of Pods #377

Closed eugen-nw closed 1 year ago

eugen-nw commented 1 year ago

Describe the Issue After deployments, kubectl get pods displays the Pods indefinitely in the 'Pending' state. In the Azure Portal, ACI shows them 'Running' after minutes and I do see there activity being logged. The Containers do produce the work we expect them to do.

Steps To Reproduce Windows containers, running on the proper --set nodeOsType="Windows" VK

Expected behavior To display the Pods as Running once they're started up.

Virtual-kubelet version helm-chart/aks/virtual-kubelet-azure-aci/1.4.7

azure-aci plugin version Not certain what is this or how to obtain the information.

Kubernetes version AKS 1.21.9

Additional context I've used the VK since at least 2019 but had never had this experience.

helayoty commented 1 year ago

@eugen-nw Thanks for contacting us, would you please share with us a sample for your pod spec and the virtual kubelet log?

eugen-nw commented 1 year ago

What is pod spec, the Dockerfile?

Could you please instruct on how can I obtain the virtual kubelet log?

helayoty commented 1 year ago

What is pod spec, the Dockerfile?

The pod specification yaml file that you used to deploy the workload on virtual kubelet.

Could you please instruct on how can I obtain the virtual kubelet log?

  • If you are using virtual kubelet as an addon via AKS, you will find a pod starting with aci-connector-linux running in kube-system namespace by running
kubectl get pod -n kube-system

kubectl logs <aci-connector-linux-POD_NAME> -n kube-system
kubectl get pod -n default

kubectl logs <virtual-kubelet-POD_NAME> -n default
eugen-nw commented 1 year ago

Certainly, please find the files attached. container-deployment.yaml.txt VK.zip

helayoty commented 1 year ago

Certainly, please find the files attached. container-deployment.yaml.txt VK.zip Would you please upload the required info to https://gist.github.com/ and share the link with us?

eugen-nw commented 1 year ago

Please see if you can use this gist: https://gist.github.com/eugen-nw/fd4526514a4ab6c3b9d995b0e76e9475 The vk.log files should have 50478 lines.

Fei-Guo commented 1 year ago

@eugen-nw I'd like to understand your workload a little more since you mentioned that it was working before. Do you need an IP for the workload running inside windows container? If yes, how was the IP shown before? ACI currently does not support vNet private IP for windows container so only public IP works there.

We recently added a check such that, from VK's perspective, if ACI does not return an IP for the container instance, it treats the Pod not ready because for native K8s, a Pod has to have an IP when its state becomes ready. That being said, to be K8s API compatible, VK has to set public IP request for windows container explicitly (which is not done today).

While thinking about the solution, I am wondering the IP requirement of your use case.

eugen-nw commented 1 year ago

@Fei-Guo Below is the behavior that I am experiencing across two AKS instances which shows that it is possible for VK to obtain the "Running" status of a my Container from ACI. Can't you guys repro this "Pending" status issue on your side?

I have little experience with VK, AKS, Containers,... I did setup only 6 AKS environments over the past 3 years, all use VK to run in ACI. This is the first AKS where I'm experiencing my running Pods to display this "Pending" Status. In another AKS instance that I setup on May 20, 2022 it displays "Running" status. The Container code is identical. The two AKS instances run within the same Azure Subscription so they share ACI.

image

The characteristics of the VK instance that runs there are below:


Name:         virtual-kubelet-virtual-kubelet-aci-for-aks-5f9b8ccbcf-5tn4g
Namespace:    default
Priority:     0
Node:         aks-agentpool-13538704-vmss000000/10.240.0.4
Start Time:   Fri, 20 May 2022 15:41:39 -0700
Labels:       app=virtual-kubelet-virtual-kubelet-aci-for-aks
              pod-template-hash=5f9b8ccbcf
Annotations:  checksum/secret: a9105fe650e6dea3914921605d3181a60b310e4d6bfa58f3625a1aa97742fe9f
Status:       Running
IP:           10.240.0.14
IPs:
  IP:           10.240.0.14
Controlled By:  ReplicaSet/virtual-kubelet-virtual-kubelet-aci-for-aks-5f9b8ccbcf
Containers:
  virtual-kubelet-virtual-kubelet-aci-for-aks:
    Container ID:  containerd://444785f9b5575a9eaa39e38f65709b2947ab6d17ec586da206d5af14cbad72a1
    Image:         mcr.microsoft.com/oss/virtual-kubelet/virtual-kubelet:1.4.2
    Image ID:      mcr.microsoft.com/oss/virtual-kubelet/virtual-kubelet@sha256:04761be99f594b109825e50b3fd324bf3f7820f28c1b09c916b64d122ecd29bc
    Port:          <none>
    Host Port:     <none>
    Command:
      virtual-kubelet
    Args:
      --provider
      azure
      --namespace

      --nodename
      virtual-kubelet
      --authentication-token-webhook=true
      --client-verify-ca
      /etc/kubernetes/certs/ca.crt
      --no-verify-clients=false
      --os
      Windows
    State:          Running
      Started:      Fri, 20 May 2022 15:41:44 -0700
    Ready:          True
    Restart Count:  0
    Environment:
      KUBELET_PORT:             10250
      APISERVER_CERT_LOCATION:  /etc/virtual-kubelet/cert.pem
      APISERVER_KEY_LOCATION:   /etc/virtual-kubelet/key.pem
      VKUBELET_POD_IP:           (v1:status.podIP)
      VKUBELET_TAINT_KEY:       virtual-kubelet.io/provider
      VKUBELET_TAINT_VALUE:     azure
      VKUBELET_TAINT_EFFECT:    NoSchedule
      ACS_CREDENTIAL_LOCATION:  /etc/acs/azure.json
      AZURE_TENANT_ID:
      AZURE_SUBSCRIPTION_ID:
      AZURE_CLIENT_ID:
      AZURE_CLIENT_SECRET:      <set to the key 'clientSecret' in secret 'virtual-kubelet-virtual-kubelet-aci-for-aks'>  Optional: false
      ACI_RESOURCE_GROUP:
      ACI_REGION:
      ACI_EXTRA_USER_AGENT:     helm-chart/aks/virtual-kubelet-aci-for-aks/1.4.0
      MASTER_URI:               https://aks-logistics-dns-ee15e054.hcp.westus.azmk8s.io:443
    Mounts:
      /etc/acs/azure.json from acs-credential (rw)
      /etc/kubernetes/certs from certificates (ro)
      /etc/virtual-kubelet from credentials (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4mrtj (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  virtual-kubelet-virtual-kubelet-aci-for-aks
    Optional:    false
  certificates:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/certs
    HostPathType:
  acs-credential:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/azure.json
    HostPathType:  File
  kube-api-access-4mrtj:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              beta.kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/not-ready:NoExecute for 300s
                             node.kubernetes.io/unreachable:NoExecute for 300s
Events:                      <none>
Fei-Guo commented 1 year ago

@eugen-nw Thanks for the reply. Actually, my question is what type of your workload is running in the ACI windows instance given that the ACI instance is not configured any vNIC with IP. Does your workload need any inbound/outbound network traffics?

I rarely see a workload that does not need networking in a container deployment. Hence the question.

eugen-nw commented 1 year ago

Our Container gets its input from a Service Bus Queue.

Fei-Guo commented 1 year ago

Our Container gets its input from a Service Bus Queue.

How can the application get its input from a Service Bus Queue when the windows ACI instance does not have any networking configured? Do I miss something?

eugen-nw commented 1 year ago

I'm very sorry but I just do not know. Can it access storage unless it does not have any networking?

It works fine on 5 other AKS clusters, only on this one it does not. The exact same code and had been operating this way since day 1.

eugen-nw commented 1 year ago

I start to guess where your questions originate from. I see in VK's log messages like time="2022-11-30T20:58:39Z" level=error msg="failed to retrieve pod aks-aci-boldiq-workforce-gozen-678d4cf57f-c8hjv status from provider" error="IPAddress cannot be nil for container group default-aks-aci-boldiq-workforce-gozen-678d4cf57f-c8hjv" method=PodsTracker.processPodUpdates node=virtual-kubelet operatingSystem=Windows provider=azure watchedNamespace=

Maybe the status retrieval functionality changed in this version of VK? Would you consider looking at that log in the other AKS where Helm chart 1.4.0 works properly? As I said, it is the exact same Container running there as well.

One idea is for you to reach out to the ACI folks, but the error is not on their side since all other Containers display their status properly.

For a twist: it is the VK that starts up each Pod in ACI, right? Is there a slight possibility that in the past whenever VK was encountering Pods that did not have a public IP address, it was assigning each new Pod instance one, so the VK could maintain communications with it? Maybe that feature was eliminated from VK in order to save on the costs of public IP addresses?

eugen-nw commented 1 year ago

Another guess: since the VK instance cannot communicate with the CG running in ACI, could it be that that fact is causing the behavior I reported in #378? VK has no way of telling the CG to die, so it leaves it behind running.

Fei-Guo commented 1 year ago

@eugen-nw Yes, the behavior change is due to a recent change in VK to skip pod update if pod does not have an IP. In your case, it seems to be ok for a running ACI instance without IP, but for more broader k8s use cases, we cannot mark Pod ready if it does not have an IP because this will confuse all other components such as service/endpoint controllers. Empty Pod IP is not a problem for ACI since cg state only depends on the state of the running container.

The fix could be one of the following: 1) Force to allocate a public IP for windows containers (it seems that ACI would not charge more for public IP). 2) Go back to the old behavior without checking IP existence.

I was thinking about 1 but I start to wonder how your use case can work without an IP. Can you please check other running windows ACI instances with older VK and see if there are public IP configured there? Note that without an IP, you cannot even do kubectl logs to retrieve logs from your ACI instance.

eugen-nw commented 1 year ago

I checked our other Container instances running in ACI and none of them has a public IP address. We never needed one. Service Bus is not calling us, but their library running in our software establishes a forever connection w/ Service Bus through which all communications flow.

I think that 2. will be a better scenario. The Producer - Consumer pattern that we're implementing (the Container is the Consumer that picks messages from a Queue) is well known so VK will encounter other Container instances with no public IP.

We scale out several times/day to hundreds of Containers running on top of VK. It would be totally wasteful for us to maintain such large count of public IPs that we do not need for communication purposes.

Fei-Guo commented 1 year ago

We can temporally move back to old behavior for windows container and add the check until ACI supports vnet private IP.

eugen-nw commented 1 year ago

Yes, PLEASE move back to the old behavior. Is there an ETA please?

Fei-Guo commented 1 year ago

All regions are reverted to 1.4.5, which should not have this issue. This problem will be addressed in 1.4.8, which will be released Jan next year.

eugen-nw commented 1 year ago

Does this mean that if I helm-deploy VK again it will deploy 1.4.5 instead of 1.4.7?

Fei-Guo commented 1 year ago

No. It means if you create an AKS cluster with VK enabled, the VK version is 1.4.5. For any existing clusters that enable the VK addon, the VK version is going to be 1.4.5. This does not affect any VK that you installed manually.

eugen-nw commented 1 year ago

The 1.4.7 Windows VK that I installed manually does not work - hence this ticket - so there's no use for me to keep it around. I gather that if I uninstall 1.4.7 and I helm-install again the Windows VK, I will get the 1.4.5 right?

Fei-Guo commented 1 year ago

No, you have to use the 1.4.5 helm chart to install 1.4.5 manually, like the instructions mentioned here: https://github.com/virtual-kubelet/azure-aci/blob/master/docs/DOWNGRADE-README.md

mishnz commented 1 year ago

@Fei-Guo Rolling back from 1.4.7 to 1.4.5 has removed a change that added a lot of regions for the ACI service: https://github.com/virtual-kubelet/azure-aci/commit/2263e89bbfcf3de4ab06b9976aa58683a419beb1 I can no longer use the ACI service in those regions since this rollback... Will I have to wait until January for this to be resolved as you indicated?

helayoty commented 1 year ago

@Fei-Guo Rolling back from 1.4.7 to 1.4.5 has removed a change that added a lot of regions for the ACI service: 2263e89 I can no longer use the ACI service in those regions since this rollback... Will I have to wait until January for this to be resolved as you indicated?

@mishnz kindly you can follow this document to use 1.4.7 for now.

eugen-nw commented 1 year ago

MANY THANKS for 1. providing instructions on how to install a version that does not have this problem 2. removing this functionality in 1.4.8!

I made an attempt to use the Option 1 command line from https://github.com/virtual-kubelet/azure-aci/blob/master/docs/DOWNGRADE-README.md to install the Windows VK. My command line is: helm install $CHART_NAME $CHART_URL --set provider=azure --set providers.azure.masterUri=$MASTER_URI --set nodeName=$NODE_NAME --set image.repository=$IMG_URL --set image.name=$IMG_REPO --set nodeOsType="Windows" --set image.tag=$IMG_TAG --set nodeOsType="Windows" --set providers.azure.masterUri=$MASTER_URI --set providers.azure.vnet.enabled=$ENABLE_VNET --set providers.azure.vnet.subnetName=$VIRTUAL_NODE_SUBNET_NAME --set providers.azure.vnet.subnetCidr=$VIRTUAL_NODE_SUBNET_RANGE --set providers.azure.vnet.clusterCidr=$CLUSTER_SUBNET_RANGE --set providers.azure.vnet.kubeDnsIp=$KUBE_DNS_IP --set providers.azure.managedIdentityID=$VIRTUALNODE_USER_IDENTITY_CLIENTID

The Pod does not start, it appears that the Image Tag is not resolving correctly, the "image.name" seems to be missing. What should I do differently please in order to install the 1.4.5 Windows VK? Is the --set image.name=$IMG_REPO option setting IMG_REPO to the right field of "image"?

Also, could the Error: InvalidImageName message say that the Windows VK is not available?

kdp virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure-5m528
Name:         virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure-5m528
Namespace:    default
Priority:     0
Node:         aks-agentpool-30560331-vmss000000/10.240.0.4
Start Time:   Wed, 07 Dec 2022 17:06:19 -0800
Labels:       app=virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure-aci
              pod-template-hash=7f84fd5c59
Annotations:  checksum/secret: cbc42ea74f18df2651df948690ceaab397825bdee31ab8760581014bd93ea5e2
Status:       Pending
IP:           10.240.0.110
IPs:
  IP:           10.240.0.110
Controlled By:  ReplicaSet/virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure-aci-7f84fd5c59
Containers:
  virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure-aci:
    Container ID:
    Image:         mcr.microsoft.com/:1.4.5
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      virtual-kubelet
    Args:
      --provider
      azure
      --namespace

      --nodename
      virtual-kubelet
      --authentication-token-webhook=true
      --client-verify-ca
      /etc/kubernetes/certs/ca.crt
      --no-verify-clients=false
      --os
      Windows
    State:          Waiting
      Reason:       InvalidImageName
    Ready:          False
    Restart Count:  0
    Environment:
      KUBELET_PORT:                        10250
      APISERVER_CERT_LOCATION:             /etc/virtual-kubelet/cert.pem
      APISERVER_KEY_LOCATION:              /etc/virtual-kubelet/key.pem
      VKUBELET_POD_IP:                      (v1:status.podIP)
      VKUBELET_TAINT_KEY:                  virtual-kubelet.io/provider
      VKUBELET_TAINT_VALUE:                azure
      VKUBELET_TAINT_EFFECT:               NoSchedule
      VIRTUALNODE_USER_IDENTITY_CLIENTID:  e3f86d26-b2a5-4f9e-a4c3-b10c08cb4235
      AKS_CREDENTIAL_LOCATION:             /etc/aks/azure.json
      AZURE_TENANT_ID:
      AZURE_SUBSCRIPTION_ID:
      AZURE_CLIENT_ID:
      AZURE_CLIENT_SECRET:                 <set to the key 'clientSecret' in secret 'virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure-aci'>  Optional: false
      ACI_RESOURCE_GROUP:
      ACI_REGION:
      ACI_EXTRA_USER_AGENT:                helm-chart/aks/virtual-kubelet-azure-aci/1.4.5
      ACI_VNET_SUBSCRIPTION_ID:
      ACI_VNET_RESOURCE_GROUP:
      ACI_VNET_NAME:
      ACI_SUBNET_NAME:                     virtual-node-aci
      ACI_SUBNET_CIDR:                     10.241.0.0/16
      MASTER_URI:                          https://aks-workforce-dns-fe064e62.hcp.westus.azmk8s.io:443
      CLUSTER_CIDR:                        10.0.0.0/16
      KUBE_DNS_IP:                         10.0.0.10
      ENABLE_REAL_TIME_METRICS:            true
      USE_VK_VERSION_2:                    true
    Mounts:
      /etc/aks/azure.json from aks-credential (rw)
      /etc/kubernetes/certs from certificates (ro)
      /etc/virtual-kubelet from credentials (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-phfxk (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure-aci
    Optional:    false
  certificates:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/certs
    HostPathType:
  aks-credential:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/azure.json
    HostPathType:  File
  kube-api-access-phfxk:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              beta.kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/not-ready:NoExecute for 300s
                             node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason         Age               From                                        Message
  ----     ------         ----              ----                                        -------
  Normal   Scheduled      <unknown>                                                     Successfully assigned default/virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure-5m528 to aks-agentpool-30560331-vmss000000
  Warning  InspectFailed  7s (x7 over 82s)  kubelet, aks-agentpool-30560331-vmss000000  Failed to apply default image tag "mcr.microsoft.com/:1.4.5": couldn't parse image reference "mcr.microsoft.com/:1.4.5": invalid reference format
  Warning  Failed         7s (x7 over 82s)  kubelet, aks-agentpool-30560331-vmss000000  Error: InvalidImageName
eugen-nw commented 1 year ago

Could someone please help me install 1.4.5? As I mentioned above a week ago, the installation of 1.4.5 does not work using the instructions provided.

Fei-Guo commented 1 year ago

You can simply edit the deployment

kubectl edit deployment virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure

and fix the image url to

image: virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure:1.4.5

eugen-nw commented 1 year ago

Thanks very much! I did not know that one can edit a deployment. However, I still could not get it to work. What should I do now?

  1. image: virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure:1.4.5 Failed to pull image "virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure:1.4.5": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure:1.4.5": failed to resolve reference "docker.io/library/virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure:1.4.5": pull access denied, repository does not exist

  2. image: mcr.microsoft.com/oss/virtual-kubelet/virtual-kubelet/virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure:1.4.5 Failed to pull image "mcr.microsoft.com/oss/virtual-kubelet/virtual-kubelet/virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure:1.4.5": rpc error: code = NotFound desc = failed to pull and unpack image "mcr.microsoft.com/oss/virtual-kubelet/virtual-kubelet/virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure:1.4.5": failed to resolve reference "mcr.microsoft.com/oss/virtual-kubelet/virtual-kubelet/virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure:1.4.5": mcr.microsoft.com/oss/virtual-kubelet/virtual-kubelet/virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure:1.4.5: not found

  3. image: mcr.microsoft.com/oss/virtual-kubelet/virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure:1.4.5 Failed to pull image "mcr.microsoft.com/oss/virtual-kubelet/virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure:1.4.5": rpc error: code = NotFound desc = failed to pull and unpack image "mcr.microsoft.com/oss/virtual-kubelet/virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure:1.4.5": failed to resolve reference "mcr.microsoft.com/oss/virtual-kubelet/virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure:1.4.5": mcr.microsoft.com/oss/virtual-kubelet/virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure:1.4.5: not found

  4. This attempt found the image but it cannot run it: image: mcr.microsoft.com/oss/virtual-kubelet/virtual-kubelet:1.4.5

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
Events:
  Type     Reason     Age                  From                                        Message
  ----     ------     ----                 ----                                        -------
  Normal   Scheduled  <unknown>                                                        Successfully assigned default/virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure-qzv5w to aks-agentpool-30560331-vmss000000
  Normal   Pulled     10m                  kubelet, aks-agentpool-30560331-vmss000000  Successfully pulled image "mcr.microsoft.com/oss/virtual-kubelet/virtual-kubelet:1.4.5" in 144.967687ms
  Normal   Pulled     10m                  kubelet, aks-agentpool-30560331-vmss000000  Successfully pulled image "mcr.microsoft.com/oss/virtual-kubelet/virtual-kubelet:1.4.5" in 332.131485ms
  Normal   Pulled     10m                  kubelet, aks-agentpool-30560331-vmss000000  Successfully pulled image "mcr.microsoft.com/oss/virtual-kubelet/virtual-kubelet:1.4.5" in 139.368742ms
  Normal   Created    9m50s (x4 over 10m)  kubelet, aks-agentpool-30560331-vmss000000  Created container virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure-aci
  Normal   Started    9m50s (x4 over 10m)  kubelet, aks-agentpool-30560331-vmss000000  Started container virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure-aci
  Normal   Pulled     9m50s                kubelet, aks-agentpool-30560331-vmss000000  Successfully pulled image "mcr.microsoft.com/oss/virtual-kubelet/virtual-kubelet:1.4.5" in 167.126528ms
  Normal   Pulling    9m1s (x5 over 10m)   kubelet, aks-agentpool-30560331-vmss000000  Pulling image "mcr.microsoft.com/oss/virtual-kubelet/virtual-kubelet:1.4.5"
  Warning  BackOff    26s (x48 over 10m)   kubelet, aks-agentpool-30560331-vmss000000  Back-off restarting failed container
helayoty commented 1 year ago

@eugen-nw the image should be mcr.microsoft.com/oss/virtual-kubelet/virtual-kubelet:1.4.5

eugen-nw commented 1 year ago

@helayoty I just showed above that that particular 1.4.5 image is not working. For extra info:

kubectl logs virtual-kubelet-azure-aci-downgrade-virtual-kubelet-azure-bxwqh
WARNING: Package "github.com/golang/protobuf/protoc-gen-go/generator" is deprecated.
        A future release of golang/protobuf will delete this package,
        which has long been excluded from the compatibility promise.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xb0 pc=0x14d8591]

goroutine 1 [running]:
github.com/virtual-kubelet/azure-aci/provider.NewACIProvider(0x0, 0x0, 0xc00007cba0, 0x7ffe8ae23a22, 0xf, 0x7ffe8ae23aa5, 0x7, 0xc000042050, 0xb, 0x280a, ...)
        /go/src/github.com/virtual-kubelet/azure-aci/provider/aci.go:241 +0x1f1
main.main.func1(0x0, 0x0, 0x7ffe8ae23a22, 0xf, 0x7ffe8ae23aa5, 0x7, 0xc000042050, 0xb, 0x280a, 0x1abf00e, ...)
        /go/src/github.com/virtual-kubelet/azure-aci/cmd/virtual-kubelet/main.go:67 +0xc5
github.com/virtual-kubelet/node-cli/internal/commands/root.runRootCommandWithProviderAndClient(0x1d12358, 0xc0000b83c0, 0x1b7d808, 0x1d35bb8, 0xc0000782c0, 0xc0003b6780, 0x0, 0x0)
        /go/pkg/mod/github.com/virtual-kubelet/node-cli@v0.7.0/internal/commands/root/root.go:163 +0x8f8
github.com/virtual-kubelet/node-cli/internal/commands/root.runRootCommand(0x1d12358, 0xc0000b83c0, 0xc0004ce830, 0xc0003b6780, 0x0, 0x0)
        /go/pkg/mod/github.com/virtual-kubelet/node-cli@v0.7.0/internal/commands/root/root.go:81 +0xfe
github.com/virtual-kubelet/node-cli/internal/commands/root.NewCommand.func1(0xc00013c000, 0xc0002fa000, 0x0, 0xc, 0x0, 0x0)
        /go/pkg/mod/github.com/virtual-kubelet/node-cli@v0.7.0/internal/commands/root/root.go:56 +0x50
github.com/spf13/cobra.(*Command).execute(0xc00013c000, 0xc00009ab70, 0xc, 0xc, 0xc00013c000, 0xc00009ab70)
        /go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:842 +0x472
github.com/spf13/cobra.(*Command).ExecuteC(0xc00013c000, 0xc00021e1b0, 0xc00013c2c0, 0xc00013cdc0)
        /go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:950 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
        /go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:887
github.com/spf13/cobra.(*Command).ExecuteContext(...)
        /go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:880
github.com/virtual-kubelet/node-cli.(*Command).Run(0xc00021e1b0, 0x1d12358, 0xc0000b83c0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /go/pkg/mod/github.com/virtual-kubelet/node-cli@v0.7.0/cli.go:170 +0x85
main.main()
        /go/src/github.com/virtual-kubelet/azure-aci/cmd/virtual-kubelet/main.go:83 +0x5ff
eugen-nw commented 1 year ago

I "helm uninstall"-ed both 1.4.7 and 1.4.5, installed the 1.4.2 Windows VK. Using the VK version 1.4.2, kubectl get pods shows the correct state of the Pods that run on the virtual Node that the VK creates.

For anyone wanting to install 1.4.2, please see below the commands I'd used.

$CHART_NAME="virtual-kubelet-azure-aci"
$NODE_NAME="virtual-kubelet"
$CHART_URL="https://github.com/virtual-kubelet/azure-aci/raw/gh-pages/charts/virtual-kubelet-1.4.2.tgz"
kubectl cluster-info
$MASTER_URI="(the Kubernetes master URI from above)" - looks like "https://<cluster name lowercase>-dns-<some identifier>.hcp.westus.azmk8s.io:443"
helm install  $CHART_NAME  $CHART_URL  --set provider=azure  --set providers.azure.masterUri=$MASTER_URI  --set nodeName=$NODE_NAME  --set nodeOsType="Windows"
eugen-nw commented 1 year ago

It's not fair to close this issue because it is present in 1.4.7 and needs to be addressed.

helayoty commented 1 year ago

I "helm uninstall"-ed both 1.4.7 and 1.4.5, installed the 1.4.2 Windows VK. Using the VK version 1.4.2, kubectl get pods shows the correct state of the Pods that run on the virtual Node that the VK creates.

For anyone wanting to install 1.4.2, please see below the commands I'd used.

$CHART_NAME="virtual-kubelet-azure-aci"
$NODE_NAME="virtual-kubelet"
$CHART_URL="https://github.com/virtual-kubelet/azure-aci/raw/gh-pages/charts/virtual-kubelet-1.4.2.tgz"
kubectl cluster-info
$MASTER_URI="(the Kubernetes master URI from above)" - looks like "https://<cluster name lowercase>-dns-<some identifier>.hcp.westus.azmk8s.io:443"
helm install  $CHART_NAME  $CHART_URL  --set provider=azure  --set providers.azure.masterUri=$MASTER_URI  --set nodeName=$NODE_NAME  --set nodeOsType="Windows"

@eugen-nw Thanks for pointing that out. we figured the .tgz binaries were the issue and we fixed it for all releases. Would you kindly try to run either 1.4.5 or 1.4.7 again?

helayoty commented 1 year ago

A new version 1.4.8 has been released today that will address this issue. You can use it by installing the helm chart in your cluster. This version will be available as a default addon for the Virtual Node by Jan 2023.

https://github.com/virtual-kubelet/azure-aci/releases/tag/1.4.8