Couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request

projectcalico / calico

Cloud native networking and network security

https://docs.tigera.io/calico/latest/about/

Apache License 2.0

6.02k stars 1.34k forks source link

Couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request #7555

Open lucasscheepers opened 1 year ago

lucasscheepers commented 1 year ago

@caseydavenport Your request was to raise a separate issue.

I installed the latest version of calico using this helm chart. The kube-apiserver-kmaster1 returns the following error in the logs: v3.projectcalico.org failed with: failing or missing response from https://**:443/apis/projectcalico.org/v3.

Also after each random kubectl command it returns errors about these CRDs.

E0406 15:33:42.335160   55793 memcache.go:106] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
NAME       STATUS   ROLES           AGE   VERSION
kmaster1   Ready    control-plane   19d   v1.26.3
kworker1   Ready    <none>          18d   v1.26.3
kworker2   Ready    <none>          18d   v1.26.3

These CRDs are automitcally installed using the helm chart stated above.

--> k api-resources | grep calico
E0406 15:34:26.465805   55853 memcache.go:255] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0406 15:34:26.481896   55853 memcache.go:106] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
bgpconfigurations                              crd.projectcalico.org/v1               false        BGPConfiguration
bgppeers                                       crd.projectcalico.org/v1               false        BGPPeer
blockaffinities                                crd.projectcalico.org/v1               false        BlockAffinity
caliconodestatuses                             crd.projectcalico.org/v1               false        CalicoNodeStatus
clusterinformations                            crd.projectcalico.org/v1               false        ClusterInformation
felixconfigurations                            crd.projectcalico.org/v1               false        FelixConfiguration
globalnetworkpolicies                          crd.projectcalico.org/v1               false        GlobalNetworkPolicy
globalnetworksets                              crd.projectcalico.org/v1               false        GlobalNetworkSet
hostendpoints                                  crd.projectcalico.org/v1               false        HostEndpoint
ipamblocks                                     crd.projectcalico.org/v1               false        IPAMBlock
ipamconfigs                                    crd.projectcalico.org/v1               false        IPAMConfig
ipamhandles                                    crd.projectcalico.org/v1               false        IPAMHandle
ippools                                        crd.projectcalico.org/v1               false        IPPool
ipreservations                                 crd.projectcalico.org/v1               false        IPReservation
kubecontrollersconfigurations                  crd.projectcalico.org/v1               false        KubeControllersConfiguration
networkpolicies                                crd.projectcalico.org/v1               true         NetworkPolicy
networksets                                    crd.projectcalico.org/v1               true         NetworkSet
error: unable to retrieve the complete list of server APIs: projectcalico.org/v3: the server is currently unable to handle the request

Do I understand it correctly that these crd.projectcalico.org/v1 CRDs are still needed - so not deleting them - and I need to manually install the v3 CRDs? If so, where can I download these v3 CRDs as I can't find it

I believe chet-tuttle-3 is facing some similar issues

caseydavenport commented 1 year ago

Do I understand it correctly that these crd.projectcalico.org/v1 CRDs are still needed - so not deleting them - and I need to manually install the v3 CRDs? If so, where can I download these v3 CRDs as I can't find it

The v1 resources are CRDs and should be present - definitely don't delete those.

The v3 resources are not CRDs - they are implemented by the calico-apiserver pod in the calico-apiserver namespace.

the server is currently unable to handle the request

This suggests a problem with the Calico API server, or a problem with the kube-apiserver being unable to communicate with the Calico API server. I'd:

kubectl describe apiservice to see if there's any breadcrumbs to follow there.
kubectl logs --tail=-1 -n calico-apiserver -l k8s-app=calico-apiserver to get the full API server logs.
kubectl describe tigerastatus apiserver for potentially further breadcrumbs.

lucasscheepers commented 1 year ago

The only apiservice that has a status of false is v3.projectcalico.org that has the following error message: failing or missing response from https://***:443/apis/projectcalico.org/v3: Get "https://***:443/apis/projectcalico.org/v3": context deadline exceeded

➜ ~ kubectl describe apiservice v3.projectcalico.org

Name:         v3.projectcalico.org
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  apiregistration.k8s.io/v1
Kind:         APIService
Metadata: 
  Creation Timestamp:  2023-04-06T12:54:20Z
  Managed Fields:
    API Version:  apiregistration.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:ownerReferences:
          .:
          k:{"uid":"***"}:
      f:spec:
        f:caBundle:
        f:group:
        f:groupPriorityMinimum:
        f:service:
          .:
          f:name:
          f:namespace:
          f:port:
        f:version:
        f:versionPriority:
    Manager:      operator
    Operation:    Update
    Time:         2023-04-06T12:54:20Z
    API Version:  apiregistration.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
          .:
          k:{"type":"Available"}:
            .:
            f:lastTransitionTime:
            f:message:
            f:reason:
            f:status:
            f:type:
    Manager:      kube-apiserver
    Operation:    Update
    Subresource:  status
    Time:         2023-04-18T12:35:03Z
  Owner References:
    API Version:           operator.tigera.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  APIServer
    Name:                  default
    UID:                   ***
  Resource Version:       ***
  UID:                     ***
Spec:
  Ca Bundle:              ***
  Group:                   projectcalico.org
  Group Priority Minimum:  1500
  Service:
    Name:            calico-api
    Namespace:       calico-apiserver
    Port:            443
  Version:           v3
  Version Priority:  200
Status:
  Conditions:
    Last Transition Time:  2023-04-06T12:54:20Z
    Message:               failing or missing response from https://10.107.208.239:443/apis/projectcalico.org/v3: Get "https://10.107.208.239:443/apis/projectcalico.org/v3": dial tcp 10.107.208.239:443: i/o timeout
    Reason:                FailedDiscoveryCheck
    Status:                False
    Type:                  Available
Events:                    <none>

The logs of the calico-apiservice looks like this: ➜ ~ kubectl logs --tail=-1 -n calico-apiserver -l k8s-app=calico-apiserver

Version:      v3.25.1
Build date:   2023-03-30T23:52:23+0000
Git tag ref:  v3.25.1
Git commit:   82dadbce1
I0413 15:15:19.483989       1 plugins.go:158] Loaded 2 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,MutatingAdmissionWebhook.
I0413 15:15:19.484036       1 plugins.go:161] Loaded 1 validating admission controller(s) successfully in the following order: ValidatingAdmissionWebhook.
I0413 15:15:19.604542       1 run_server.go:69] Running the API server
I0413 15:15:19.604578       1 run_server.go:58] Starting watch extension
W0413 15:15:19.606431       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0413 15:15:19.630055       1 secure_serving.go:210] Serving securely on [::]:5443
I0413 15:15:19.630147       1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/calico-apiserver-certs/tls.crt::/calico-apiserver-certs/tls.key"
I0413 15:15:19.630257       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0413 15:15:19.630679       1 run_server.go:80] apiserver is ready.
I0413 15:15:19.631104       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0413 15:15:19.631114       1 shared_informer.go:255] Waiting for caches to sync for RequestHeaderAuthRequestController
I0413 15:15:19.631204       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0413 15:15:19.631212       1 shared_informer.go:255] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0413 15:15:19.631282       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0413 15:15:19.631290       1 shared_informer.go:255] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0413 15:15:19.732007       1 shared_informer.go:262] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0413 15:15:19.732076       1 shared_informer.go:262] Caches are synced for RequestHeaderAuthRequestController
I0413 15:15:19.732510       1 shared_informer.go:262] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
Version:      v3.25.1
Build date:   2023-03-30T23:52:23+0000
Git tag ref:  v3.25.1
Git commit:   82dadbce1
I0413 15:15:45.802642       1 plugins.go:158] Loaded 2 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,MutatingAdmissionWebhook.
I0413 15:15:45.802806       1 plugins.go:161] Loaded 1 validating admission controller(s) successfully in the following order: ValidatingAdmissionWebhook.
I0413 15:15:45.871553       1 run_server.go:58] Starting watch extension
I0413 15:15:45.871726       1 run_server.go:69] Running the API server
W0413 15:15:45.872885       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0413 15:15:45.885723       1 secure_serving.go:210] Serving securely on [::]:5443
I0413 15:15:45.886356       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0413 15:15:45.886370       1 shared_informer.go:255] Waiting for caches to sync for RequestHeaderAuthRequestController
I0413 15:15:45.886523       1 run_server.go:80] apiserver is ready.
I0413 15:15:45.886549       1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/calico-apiserver-certs/tls.crt::/calico-apiserver-certs/tls.key"
I0413 15:15:45.886667       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0413 15:15:45.888123       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0413 15:15:45.888133       1 shared_informer.go:255] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0413 15:15:45.888363       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0413 15:15:45.888375       1 shared_informer.go:255] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0413 15:15:45.986627       1 shared_informer.go:262] Caches are synced for RequestHeaderAuthRequestController
I0413 15:15:45.988477       1 shared_informer.go:262] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0413 15:15:45.988829       1 shared_informer.go:262] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file

And the tigerastatus apiserver looks like this: ➜ ~ kubectl describe tigerastatus apiserver

Name:         apiserver
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  operator.tigera.io/v1
Kind:         TigeraStatus
Metadata:
  Creation Timestamp:  2023-03-24T16:01:19Z
  Generation:          1
  Managed Fields:
    API Version:  operator.tigera.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
    Manager:      operator
    Operation:    Update
    Time:         2023-03-24T16:01:19Z
    API Version:  operator.tigera.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:conditions:
    Manager:         operator
    Operation:       Update
    Subresource:     status
    Time:            2023-04-13T15:15:56Z
  Resource Version:  ***
  UID:               ***
Spec:
Status:
  Conditions:
    Last Transition Time:  2023-04-06T12:54:24Z
    Message:               All Objects Available
    Observed Generation:   1
    Reason:                AllObjectsAvailable
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2023-04-13T15:15:56Z
    Message:               All objects available
    Observed Generation:   1
    Reason:                AllObjectsAvailable
    Status:                True
    Type:                  Available
    Last Transition Time:  2023-04-13T15:15:56Z
    Message:               All Objects Available
    Observed Generation:   1
    Reason:                AllObjectsAvailable
    Status:                False
    Type:                  Progressing
Events:                    <none>

@caseydavenport Can you maybe point me in the correct direction with this information?

kinbod commented 1 year ago

I also meet this issue, my cluster bases on openstack VMs.

0HFgX1pM8hUbvsCpAl3D commented 1 year ago

@lucasscheepers I was running into the same issue and was able to get around it by following the Manifest Install directions here: https://docs.tigera.io/calico/latest/operations/install-apiserver

Specifically the patch command fixed the issue: kubectl patch apiservice v3.projectcalico.org -p \ "{\"spec\": {\"caBundle\": \"$(kubectl get secret -n calico-apiserver calico-apiserver-certs -o go-template='{{ index .data "apiserver.crt" }}')\"}}"

ndacic commented 1 year ago

getting the same issue while trying to install prometheus operator.

❯ k get apiservices.apiregistration.k8s.io -A
NAME SERVICE AVAILABLE AGE v2.autoscaling Local True 59d v2beta1.helm.toolkit.fluxcd.io Local True 11d v2beta2.autoscaling Local True 59d v3.projectcalico.org calico-apiserver/calico-api False (FailedDiscoveryCheck) 6m38s

❯ k get apiservices.apiregistration.k8s.io  v3.projectcalico.org -o yaml
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  creationTimestamp: "2023-05-08T00:17:15Z"
  name: v3.projectcalico.org
  ownerReferences:
  - apiVersion: operator.tigera.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: APIServer
    name: default
    uid: 34d2ccfa-07e2-4ec8-82b1-25a3e9e3be73
  resourceVersion: "25407370"
  uid: 830669ef-f81a-4e2d-9765-e3e2066f8f33
spec:
  caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUR5akNDQXJLZ0F3SUJBZ0lJVjgwSlNBRkFKWkF3RFFZSktvWklodmNOQVFFTEJRQXdJVEVmTUIwR0ExVUUKQXhNV2RHbG5aWEpoTFc5d1pYSmhkRzl5TFhOcFoyNWxjakFlRncweU16QTFNRGN3TVRNMU5EVmF
  group: projectcalico.org
  groupPriorityMinimum: 1500
  service:
    name: calico-api
    namespace: calico-apiserver
    port: 443
  version: v3
  versionPriority: 200
status:
  conditions:
  - lastTransitionTime: "2023-05-08T00:17:15Z"
    message: 'failing or missing response from https://10.20.3.132:5443/apis/projectcalico.org/v3:
      Get "https://10.20.3.132:5443/apis/projectcalico.org/v3": dial tcp 10.20.3.132:5443:
      i/o timeout'
    reason: FailedDiscoveryCheck
    status: "False"
    type: Available

ndacic commented 1 year ago

resolved by using new helm release version for prometheus operator

ndacic commented 1 year ago

but still getting

❯ k get apiservices.apiregistration.k8s.io v3.projectcalico.org -o yaml
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  creationTimestamp: "2023-05-11T19:04:37Z"
  name: v3.projectcalico.org
  ownerReferences:
  group: projectcalico.org
  groupPriorityMinimum: 1500
  service:
    name: calico-api
    namespace: calico-apiserver
    port: 443
  version: v3
  versionPriority: 200
status:
  conditions:
  - lastTransitionTime: "2023-05-11T19:04:37Z"
    message: 'failing or missing response from https://10.10.101.253:5443/apis/projectcalico.org/v3:
      Get "https://10.10.101.253:5443/apis/projectcalico.org/v3": dial tcp 10.10.101.253:5443:
      i/o timeout'
    reason: FailedDiscoveryCheck
    status: "False"
    type: Available

ndacic commented 1 year ago

@lucasscheepers did you manage to resolve it? Namespace deletions get stuck in terminating state due to this apiservice not being responsive. Any way of turning it off?

suciuandrei94 commented 1 year ago

We are faced with same problem when installing tigera-operator helm chart with APIServer enabled on EKS cluster.

dhananjaipai commented 1 year ago

We are faced with same problem when installing tigera-operator helm chart with APIServer enabled on EKS cluster.

I was also facing the same issue, and fixed it for the time being by running

kubectl delete apiserver default

Based on https://docs.tigera.io/calico/latest/operations/install-apiserver#uninstall-the-calico-api-server

Since we are using the default calico helm chart based install, I think the apiserver was getting created, but not configuring it properly perhaps. And since I doubt if we have a need to update the Calico settings from kubectl as part of our use-case, I think it is best to delete it for now. I will also try to find some helm value in the tigera-operator to disable this from start if possible.

PS: I am new to Calico, and please let me know if this is "unsafe" to remove, although the documentation above does not seem to suggest so.

EDIT It is easy to disable the apiServer with the helm values

apiServer:
  enabled: true # Change to false

Also, it seems it is not so important after all - https://docs.tigera.io/calico/latest/reference/architecture/overview#calico-api-server The component architecture says it is only needed to manage calico with kubectl, and I think that would logically mean, it is not used from "within" the cluster.

suciuandrei94 commented 1 year ago

If you are uninstalling apiserver, then you won't be able to install networkpolicies with helm chart. Working around the problem doesn't solve the issue.

xpuska513 commented 1 year ago

For some reason, the calico-apiserver pod is failing on liveness probes because the apiserver is not starting correctly or something is not working at all, due to that, the apiservice is getting reported as FailedDiscoveryCheck. I tried to play around with deployment and other things but wasn't able to achieve something. Is there any way to enable debug logs for apiserver?

I also saw that csi nodedriver for calico was failing with following reason:

kubectl logs -f -n calico-system csi-node-driver-pzwsl -c csi-node-driver-registrar
/usr/local/bin/node-driver-registrar: error while loading shared libraries: libresolv.so.2: cannot open shared object file: No such file or directory

dhananjaipai commented 1 year ago

If you are uninstalling apiserver, then you won't be able to install networkpolicies with helm chart.

I am a bit confused and might be stupid here, but I think, granted without the calico api server you will not be able to use the projectcalico.org/v3 apiVersion, but you should still be able to use networking.k8s.io/v1 for the NetworkPolicy resource? I don't know if there is a major difference between both and a quick search says I still can use the latter within a given Kubernetes cluster with Calico CNI installed.

Working around the problem doesn't solve the issue.

Yeah, definitely the issue has to be addressed, I just wanted it to stop shouting the error a dozen times every time I try to deploy something into my EKS with helm.

suciuandrei94 commented 1 year ago

The whole point of installing this operator is being able to use projectcalico.org/v3 resources.

Extends Kubernetes network policy Calico network policy provides a richer set of policy capabilities than Kubernetes including: policy ordering/priority, deny rules, and more flexible match rules. While Kubernetes network policy applies only to pods, Calico network policy can be applied to multiple types of endpoints including pods, VMs, and host interfaces. Finally, when used with Istio service mesh, Calico network policy supports securing applications layers 5-7 match criteria, and cryptographic identity.

dhananjaipai commented 1 year ago

The whole point of installing this operator is being able to use projectcalico.org/v3 resources.

Ah, for us the whole point was to replace the default AWS EKS VPC CNI with Calico CNI, and be able to deploy more pods per node and save IPs form the VPC CIDR allocated to us - since the former gives all the pods IPs from this range and also limits the number of pods per node based on node size. For us the Calico installation using Helm from the official documentation (which installs the operator) introduced this apiServer and the related errors.

So, guess the solution is valid if you just want the CNI and are limited to use with the K8s NetworkPolicy!

ndacic commented 1 year ago

same, running it with tigera on EKS. Deinstalled API server and resolved it. We will see what the consequences will be. According to docs it should only block tigera cli stuff...

coutinhop commented 1 year ago

@xpuska513 this seems like an issue with csi-node-driver-registrar, could you share more details of you setup (versions, etc)?

kubectl logs -f -n calico-system csi-node-driver-pzwsl -c csi-node-driver-registrar
/usr/local/bin/node-driver-registrar: error while loading shared libraries: libresolv.so.2: cannot open shared object file: No such file or directory

Everyone else (if you're still able to reproduce this issue), could you post kubectl logs for the Calico apiserver pod(s)?

xpuska513 commented 1 year ago

@coutinhop Deployed on EKS using official helm chart - version 3.25 k8s version 3.22, vpc cni - 1.12.6. Lemme try to deploy it again on fresh cluster tomorrow and I can gather more info on it. Is there any logs(from pods) useful that I can share with you?

xpuska513 commented 1 year ago

Small observation from me, it works fine on older k8s(1.21.4) and older vpc cni release(1.8.x)

vhelke commented 1 year ago

Everyone else (if you're still able to reproduce this issue), could you post kubectl logs for the Calico apiserver pod(s)?

$ kubectl logs calico-apiserver-7fb88d684f-fh5x7 -n calico-apiserver
E0517 13:12:47.384967   36471 memcache.go:287] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0517 13:12:47.387141   36471 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0517 13:12:47.389521   36471 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
Version:      v3.25.1
Build date:   2023-03-30T23:52:23+0000
Git tag ref:  v3.25.1
Git commit:   82dadbce1
I0517 12:43:22.874404       1 plugins.go:158] Loaded 2 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,MutatingAdmissionWebhook.
I0517 12:43:22.874578       1 plugins.go:161] Loaded 1 validating admission controller(s) successfully in the following order: ValidatingAdmissionWebhook.
I0517 12:43:22.955555       1 run_server.go:69] Running the API server
I0517 12:43:22.970749       1 run_server.go:58] Starting watch extension
W0517 12:43:22.970895       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0517 12:43:22.976511       1 secure_serving.go:210] Serving securely on [::]:5443
I0517 12:43:22.976690       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0517 12:43:22.976781       1 shared_informer.go:255] Waiting for caches to sync for RequestHeaderAuthRequestController
I0517 12:43:22.976884       1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/calico-apiserver-certs/tls.crt::/calico-apiserver-certs/tls.key"
I0517 12:43:22.977142       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0517 12:43:22.977618       1 run_server.go:80] apiserver is ready.
I0517 12:43:22.977748       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0517 12:43:22.977829       1 shared_informer.go:255] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0517 12:43:22.977901       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0517 12:43:22.977966       1 shared_informer.go:255] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0517 12:43:23.077860       1 shared_informer.go:262] Caches are synced for RequestHeaderAuthRequestController
I0517 12:43:23.078006       1 shared_informer.go:262] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0517 12:43:23.078083       1 shared_informer.go:262] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file

turletti commented 1 year ago

k -ncalico-apiserver logs calico-apiserver-8757dcdf8-4z79m E0522 12:27:06.023957 1742797 memcache.go:255] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request E0522 12:27:06.024770 1742797 memcache.go:106] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request E0522 12:27:06.026818 1742797 memcache.go:106] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request Error from server: Get "https://138.96.245.50:10250/containerLogs/calico-apiserver/calico-apiserver-8757dcdf8-4z79m/calico-apiserver": tunnel closed

caseydavenport commented 1 year ago

For those of you running on EKS, can you confirm that the calico-apiserver is running with hostNetwork: true set?

The Kubernetes API server needs to establish connection with the Calico API server, and on EKS the Kubernetes API server runs in a separate Amazon managed VPC, meaning it doesn't have routing access to pod IPs (just host IPs). As such, the Calico API server needs to run with host networking. The tigera-operator should do this automatically for you, but I'd like to double check in case something isn't detecting this correctly.

vhelke commented 1 year ago

Everyone else (if you're still able to reproduce this issue), could you post kubectl logs for the Calico apiserver pod(s)?
$ kubectl logs calico-apiserver-7fb88d684f-fh5x7 -n calico-apiserver
E0517 13:12:47.384967   36471 memcache.go:287] couldn't get resource list for projectcalico.org/v3: the server is currently 

For me this was some sort of firewall issue. I have configured firewall to be more permissive and I don't see this issue anymore.

xpuska513 commented 1 year ago

@caseydavenport for me on eks it wasn't running on host network fir some reason.

caseydavenport commented 1 year ago

@xpuska513 interesting. Could you share how you installed Calico / the apiserver on this cluster?

xpuska513 commented 1 year ago

@caseydavenport I followed this guide https://docs.aws.amazon.com/eks/latest/userguide/calico.html which referenced this doc for tigera operator deployment https://docs.tigera.io/calico/latest/getting-started/kubernetes/helm#install-calico basically deployed the helm chart with kubernetesProvider set to EKS and I also assume that it auto-deploys apiserver out of the box when you deploy the operator using helm chart.

caseydavenport commented 1 year ago

Yep, gotcha. That is the correct guide to follow. I realize now that if you are using the EKS VPC CNI plugin, then it is OK to have the apiserver running with hostNetwork: false, so that's unlikely to be the problem here.

junglie85 commented 1 year ago

Deploying today:

kubectl -n calico-apiserver logs deployment.apps/calico-apiserver
E0526 12:39:07.647072  164182 memcache.go:255] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0526 12:39:07.737485  164182 memcache.go:106] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0526 12:39:07.789410  164182 memcache.go:106] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
Found 2 pods, using pod/calico-apiserver-59dcddc4d5-d4sfp
Version:      v3.25.1
Build date:   2023-03-30T23:52:23+0000
Git tag ref:  v3.25.1
Git commit:   82dadbce1
I0526 10:06:19.554880       1 plugins.go:158] Loaded 2 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,MutatingAdmissionWebhook.
I0526 10:06:19.554927       1 plugins.go:161] Loaded 1 validating admission controller(s) successfully in the following order: ValidatingAdmissionWebhook.
I0526 10:06:19.643486       1 run_server.go:69] Running the API server
I0526 10:06:19.643504       1 run_server.go:58] Starting watch extension
W0526 10:06:19.643526       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0526 10:06:19.660286       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0526 10:06:19.660394       1 shared_informer.go:255] Waiting for caches to sync for RequestHeaderAuthRequestController
I0526 10:06:19.660287       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0526 10:06:19.660425       1 shared_informer.go:255] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0526 10:06:19.660318       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0526 10:06:19.660592       1 shared_informer.go:255] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0526 10:06:19.660670       1 secure_serving.go:210] Serving securely on [::]:5443
I0526 10:06:19.660521       1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/calico-apiserver-certs/tls.crt::/calico-apiserver-certs/tls.key"
I0526 10:06:19.660736       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0526 10:06:19.661221       1 run_server.go:80] apiserver is ready.
I0526 10:06:19.761160       1 shared_informer.go:262] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0526 10:06:19.761194       1 shared_informer.go:262] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0526 10:06:19.761256       1 shared_informer.go:262] Caches are synced for RequestHeaderAuthRequestController

I followed the guide linked above by @xpuska513 too and hostNetwork: false (well, not set). I manually installed the CRD's because I was getting errors about the annotation length exceeding the maximum.

caseydavenport commented 1 year ago

because I was getting errors about the annotation length exceeding the maximum.

FWIW, this usually comes from using kubectl apply since kubectl adds the annotation. You should be able to do kubectl create and kubectl replace instead in order to avoid that, and it's what we currently recommend.

sig-piskule commented 1 year ago

I am having this issue as well.

Diagnostics:

I began by installing the Calico on an AWS EKS cluster. I used the helm chart using v3.25.1. I skipped the Customize Helm Chart section of the documentation, because the [AWS documentation] (https://docs.aws.amazon.com/eks/latest/userguide/calico.html#calico-install) brings you directly to this section, so I did not bother reading the previous sections. As a result, my initial installation was without installation.kubernetesProvider: EKS.

My initial installation succeeded, and I was about to proceed with the remainder of the EKS documentation. However, I noticed that I did not like the choice of my name for the helm installation. I chose the namespace calico instead of tigera-operator, and I wanted to match the AWS documentation. As a result, I attempted to delete everything, and reinstall the helm chart. Using tigera-operator for the namespace was not allowed, possibly due to a bug in the helm chart, and I gave up and went back to using the calico namespace.

I do not know where things went foul after this section. I don't recall if I attempted to helm uninstall or what. But I do remember attempting to delete the namespaces, and getting into conditions where the namespace got stuck deleting due to various finalizers. I attempted to search for the various stuck entities and delete them. I did so successfully, and I was ultimately able to get the namespaces to delete. I believe I ran kubectl get ns calico-system -o yaml and it warned me it was stuck deleting service accounts.

Current Symptoms

I can sort of delete and reinstall calico. If I delete calico, even using helm uninstall, things get strangely stuck. The calico and calico-apiserver namespace will delete, but calico-system remains. If I run kubectl delete ns calico-system, that gets stuck due to the finalizers. I can then describe the namespace, where it will warn me about the serviceaccounts. If I delete the finalizers for the serviceaccounts, the calico-system namespace will finally delete. I can then delete the calico namespace.

There is definitely left-over resources on my K8S cluster. I found a tool, kubectl really get all which helps show me the additional resources. I had a strong suspicion that my installation is corrupt, and the best way would be to literally rebuild my entire cluster, but that is really not a good idea operationally. When I have live customers, we cannot be expected to rebuild the entire cluster if Calico has an issue.

I tried to delete all leftover resources and see to see if I could get a reinstall working.

# List out all resources in your cluster
~/kubectl-really-get-all --all-namespaces > /tmp/tmp

# Identify Calico components
cat /tmp/tmp | grep -iE "(tigera|calico)"

Once I have the list of calico components I can pipe the output into XARGS to delete everything

# DANGEROUS! DO NOT DO UNLESS YOU KNOW WHAT YOU ARE DOING!
 ... | awk '{print $1}' | xargs -L1 -I input bash -c "kubectl delete input"

This initially kept getting stuck, so I would need to manually modify the the items that got stuck and remove the finalizer, such as kubectl edit clusterrole.rbac.authorization.k8s.io/calico-node.

I then verified that everything was completely deleted using kubectl really get all. However, after reinstallation, calico came online. The calico-system after immediately getting created was stuck in a terminating state.

status:
  conditions:
  - lastTransitionTime: "2023-06-23T16:29:14Z"
    message: 'Discovery failed for some groups, 1 failing: unable to retrieve the
      complete list of server APIs: projectcalico.org/v3: the server is currently
      unable to handle the request'
    reason: DiscoveryFailed
    status: "True"
    type: NamespaceDeletionDiscoveryFailure
  - lastTransitionTime: "2023-06-23T16:29:14Z"
    message: All legacy kube types successfully parsed
    reason: ParsedGroupVersions
    status: "False"
    type: NamespaceDeletionGroupVersionParsingFailure
  - lastTransitionTime: "2023-06-23T16:29:14Z"
    message: All content successfully deleted, may be waiting on finalization
    reason: ContentDeleted
    status: "False"
    type: NamespaceDeletionContentFailure
  - lastTransitionTime: "2023-06-23T16:29:14Z"
    message: All content successfully removed
    reason: ContentRemoved
    status: "False"
    type: NamespaceContentRemaining
  - lastTransitionTime: "2023-06-23T16:29:14Z"
    message: All content-preserving finalizers finished
    reason: ContentHasNoFinalizers
    status: "False"
    type: NamespaceFinalizersRemaining
  phase: Terminating

At this point, I'm not sure how to clean my system. I may try it 1 more time, but I may be stuck with VPC + Cluster rebuild.

sig-piskule commented 1 year ago

Updates

I was able to get it to install properly and fix the bug. I attribute it to some sort of race condition between calico-system trying to delete, and calico trying to install. I can't say exactly what my order of steps were, other than running helm install and helm uninstall very quickly.

With everything working, I wanted to make sure I got a 'clean' install of calico. So I attempted helm uninstall. This however, results in calico-system being left-over, and the helm uninstall failing due to timeout.

With the partially uninstalled helm-chart, I am back my original problem. I attempted to reinstall the calico helm chart, and the left-over calico-system is now stuck terminating.

I am now at the conclusion that calico can't really be uninstalled properly-- or that my initial attempt at uninstalling corrupted the entire system and there is no-way back. I will likely try to get it working again, simply by getting everything installed again.

Final Update

I was able to get everything 'working' again by deleting the stuck service account (due to the finalizer) in the calico-system. It was some combination of installing & uninstalling calico w/ helm, and it eventually came up clean.

I was able to upgrade calico from terraform, so I think this hacky way of getting it to work will be OK temporarily.

caseydavenport commented 1 year ago

@sig-piskule thanks for the updates - sounds like you're running up against the known helm uninstall race condition problems, for which I have a PR in progress: https://github.com/tigera/operator/pull/2662

sig-piskule commented 1 year ago

@caseydavenport

Thank you for the update. That's helpful. Since AWS is directly linking to Calico, you might want to update your document with a big red warning that says "Prior to installing Calico, make sure you are correctly configured your YAML". However, my 2 cents is that the EKS configuration doesn't seem to be actually required, since Calico worked the first time.

Unfortunately, I am hitting the issue again. I don't know if something mysteriously got redeployed, or if it just suddenly started happening, or if I was too tired on a Friday to wait for the problem to start happening again. Regardless, here is what I see:

More Diagnostics

The problem E0627 11:56:49.015185 15798 memcache.go:255] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request mysteriously started happening again, and I need to unfortunately spend more time debugging this. This is critical to me, since I need to be able to install Network Policies using Infrastructure As Code, similar to this blog.

I came across this link, which provided some hints: https://github.com/helm/helm/issues/6361

This provide some interesting details:

$ kubectl get apiservice | grep calico
v1.crd.projectcalico.org               Local                         True                           3d23h
v3.projectcalico.org                   calico-apiserver/calico-api   False (FailedDiscoveryCheck)   3d22h

We can see that the API Service has failed discovery check. Digging in more:

$ kubectl get apiservice v3.projectcalico.org -o yaml
...
status:
  conditions:
  - lastTransitionTime: "2023-06-23T17:37:10Z"
    message: 'failing or missing response from https://10.2.192.137:5443/apis/projectcalico.org/v3:
      Get "https://10.2.192.137:5443/apis/projectcalico.org/v3": dial tcp 10.2.192.137:5443:
      i/o timeout'
    reason: FailedDiscoveryCheck
    status: "False"
    type: Available

So now I know where the issue is occurring, I can begin to actually diagnose this problem. We should check to what is going on from the pod end that should be serving the requests:

$ kubectl get pods -n calico-apiserver -o wide
NAME                                READY   STATUS    RESTARTS   AGE     IP             NODE                           NOMINATED NODE   READINESS GATES
calico-apiserver-86cbf6c7fc-5cj2x   1/1     Running   0          9m24s   10.2.192.250   ip-10-2-192-242.ec2.internal   <none>           <none>
calico-apiserver-86cbf6c7fc-zvl85   1/1     Running   0          9m25s   10.2.192.137   ip-10-2-192-156.ec2.internal   <none>           <none>

I ran an ubuntu bastion pod, and from the pod, I curled the API server:

$ curl -k https://10.2.192.137:5443/apis/projectcalico.org/v3
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/apis/projectcalico.org/v3\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403

This is a different error message than the timeout. This indicates that the server is online and responding, but that we are unauthenticated. As a result, this seems like a control plane issue, as as @caseydavenport mentioned before, I could try with hostNetwork: true

I attempted to do this with kubectl edit, but the deployment will not update the pods. I cannot edit the pods directly either.

$ cat << EOF > /tmp/patch
spec:
  template:
    spec:
      hostNetwork: true
EOF

$ kubectl patch deployment calico-apiserver -n calico-apiserver --patch-file /tmp/patch 
deployment.apps/calico-apiserver patched

$ kubectl get deployment -n calico-apiserver -o yaml | grep host
                topologyKey: kubernetes.io/hostname

I then considered that perhaps the Tigera Operator was controlling this calude. I investigated if it was possible to modify this from the helm chart, but it does not seem that this is possible, since it is not located in the documentation.

We are planning to use the VPC CNI plugin soon, however it isn't installed yet. Therefore, setting hostNetwork: true does seem related to the problem as indicated here. I am not sure how it might be possible to set this. It is however suggested this is possible here.

At this point I'm a little lost as to how this can be fixed. I am still digging though, so I may post another update. I am posting as much as I am so that this is perhaps helpful to someone else who stumbles upon this.

EDIT I'm pretty sure the Operator controls hostNetwork, and it is impossible to configure this. This code suggests that hostNetwork is only set to true if you are configured to run EKS and Calico CNI. And This code suggests that hostNetwork is false by default and not configurable.

caseydavenport commented 1 year ago

Good sleuthing! Agree with what you've said above.

I'm pretty sure the Operator controls hostNetwork, and it is impossible to configure this. This code suggests that hostNetwork is only set to true if you are configured to run EKS and Calico CNI. And This code suggests that hostNetwork is false by default and not configurable.

This is correct - the tigera operator generally owns that setting and there's no user knob for it at the moment. Generally, this is just set to the right thing based on other config and you shouldn't need to worry about it...... That said, to confirm my understanding - you're seeing that running on EKS with Calico CNI, the operator isn't setting hostNetwork: true on your apiserver pods?

If so, could you share your Installation config so I can confirm all looks OK? (kubectl get installation -o yaml)

Generally the way to make sure that is set correctly is to ensure that the CNI type is Calico and that the kubernetesProvider field is set to EKS.

sig-piskule commented 1 year ago

That said, to confirm my understanding - you're seeing that running on EKS with Calico CNI, the operator isn't setting hostNetwork: true on your apiserver pods?

No, that's not what I am doing.

Additional Diagnostics

I have a few setups I have tried: USECASE 1: EKS, No CNI whatsoever w/ Service Secondary Subnet (during development while preparing usecase 2) USECASE 2: EKS, with AWS VPC CNI w/ Service Secondary Subnet

My gut is telling me that in order for the API Server to work with this setup, the API Server Must be on the hostNetwork, but that there is no actual way to configure it.

In both cases, I get

  - lastTransitionTime: "2023-06-28T19:35:09Z"
    message: 'failing or missing response from https://100.112.25.27:5443/apis/projectcalico.org/v3:
      Get "https://100.112.25.27:5443/apis/projectcalico.org/v3": dial tcp 100.112.25.27:5443:
      i/o timeout'

In both cases, $ kubectl get deployment -n calico-apiserver -o yaml | grep host shows that hostNetwork is not set. I furthermore cannot configure it for additional debug. I can confirm that CNI is installed by running

$ aws eks describe-addon --cluster-name $CLUSTER_NAME --addon-name vpc-cni --query addon.addonVers
ion --output text $PROFILE
v1.13.2-eksbuild.1

So CNI is definitely installed. I then ran a few commands:

echo '{ installation: {kubernetesProvider: EKS }}' > values.yaml
kubectl create namespace calico
helm install calico projectcalico/tigera-operator --version v3.26.1 -f values.yaml --namespace calico

And checked everything:

calico-apiserver   calico-apiserver-68647c5f95-cfwm6               1/1     Running   0          43m
calico-apiserver   calico-apiserver-68647c5f95-dz4bt               1/1     Running   0          43m
calico-system      calico-kube-controllers-5977f687c9-lm5zc        1/1     Running   0          44m
calico-system      calico-node-56jzf                               1/1     Running   0          44m
calico-system      calico-node-m8wzb                               1/1     Running   0          44m
calico-system      calico-typha-6758886c9-hvnfs                    1/1     Running   0          44m
calico-system      csi-node-driver-9p78n                           2/2     Running   0          44m
calico-system      csi-node-driver-b9xcf                           2/2     Running   0          44m
calico             tigera-operator-959786749-w7w76                 1/1     Running   0          44m

I did notice that I missed step 5 on this new setup, but this did not resolve anything. This may be helpful for others though, so I am reporting it here:

If you're using version 1.9.3 or later of the Amazon VPC CNI plugin for Kubernetes, then enable the plugin to add the Pod IP address to an annotation in the calico-kube-controllers-55c98678-gh6cc Pod spec. For more information about this setting, see ANNOTATE_POD_IP on GitHub.

After all this, the Calico Stars demo only shows a connection from B to F back, and C is entirely missing.

Important Partial Resolution

I have discovered that AWS is doing some "bad" things (With partial blame on Calico for the Stars demo). In particular, your documentation states that we can manage Calico Resources using kubectl here.

If I try the following test (similar to this document):

$ cat << EOF > sample.yaml
apiVersion: projectcalico.org/v3
kind: NetworkPolicy
metadata:
  name: allow-tcp-6379
  namespace: production
EOF

I get the following error:

$ kubectl apply -f sample.yaml
error: resource mapping not found for name: "allow-access" namespace: "test" from "sample.yaml": no matches for kind "NetworkPolicy" in version "projectcalico.org/v3"
ensure CRDs are installed first

I have discovered that AWS has circumvented the Calico documentation, by using the following API version

This is actually a Calico file that AWS uses, found here.

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  namespace: client 
  name: allow-ui 
spec:
  podSelector:
    matchLabels: {}
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              role: management-ui

As a result, you can have a completely working AWS Demo by following the AWS Guide, but when you try to do more by following Calico Documentation, you will get stuck due to the API Server. This still doesn't explain why kubectl get pods was failing on my corrupted cluster or how to fix it

As a result, the AWS Demo doesn't even use the V3 API Server, so it cannot be conclusively determined if it ever worked in the first place.

I might be able to continue with this information, though I am still concerned that my test doesn't show all the nodes connected. That is a possible networking error on my end somewhere. If I figure it out I will post more here.

caseydavenport commented 1 year ago

My gut is telling me that in order for the API Server to work with this setup, the API Server Must be on the hostNetwork, but that there is no actual way to configure it.

USECASE 1: EKS, No CNI whatsoever w/ Service Secondary Subnet (during development while preparing usecase 2)

What do you mean by no CNI whatsoever? A CNI plugin is needed in order for pod networking to function.

USECASE 2: EKS, with AWS VPC CNI w/ Service Secondary Subnet

With the AWS VPC CNI, you do not need hostNetwork: true. So this is expected.

error: resource mapping not found for name: "allow-access" namespace: "test" from "sample.yaml": no matches for kind "NetworkPolicy" in version "projectcalico.org/v3"

This suggests a problem with the Calico API server, which is consistent with the rest of this issue.

apiVersion: networking.k8s.io/v1

This is the upstream Kubernetes NetworkPolicy API.

In general, I wouldn't think of the stars demo as a comprehensive cluster health check - it's only testing a very specific subset of functionality and it's not intended to certify that a cluster is 100% functional.

sig-piskule commented 1 year ago

Hi @caseydavenport , thanks for getting back to me.

What do you mean by no CNI whatsoever? A CNI plugin is needed in order for pod networking to function.

By default, when you create an EKS cluster, you don't have VPC CNI installed. By default, Pods get the same IP Addresses as nodes. You can through an additional configuration configure Services to get a different CIDR block from the Nodes, but the Pods cannot get a different CIDR block. This leads to the IP space exhaustion. Perhaps there is a CNI installed, but it comes by default, and I did nothing to get it there.

So what I mean by USECASE1, is that an EKS cluster without VPC-CNI installed. Whatever is there by default.

With the AWS VPC CNI, you do not need hostNetwork: true. So this is expected.

I might disagree with you on this. I do not have time to more thoroughly diagnose, as our base requirements are satisfied. What my point is, it is currently not possible to get the Calico API Server to be functional (to serve projectcalico.org/v3). If you try, you will notice that the service to contact the API server is not accessible. I surmise (I don't know) that hostNetwork: true even with VPC CNI-- only for the purposes of accessing the API server. The otherfunctionality provided by the Stars demo is there, but it is not possible to get API Server working.

Regardless of my hypotheses, the API Server is not available at the end result of AWS's installation documentation.

In general, I wouldn't think of the stars demo as a comprehensive cluster health check - it's only testing a very specific subset of functionality and it's not intended to certify that a cluster is 100% functional.

Agreed, and that is somewhat my complaint. I can argue that AWS's documentation results in a cluster that is not 100% functional. AWS Furthermore argues that it is functional, via the stars demo.

The entire setup was using the upstream Kubernetes NetworkPolicy API, yet nowhere in the documentation was that made clear. Someone should have said somewhere "By the way, although you installed Calico CNI, it is not 100% functional, and you can't use the Calico API Server, and you can only use Upstream Kubernetes NetworkPolicy API because Calico API does not work".

I wish that point had been made more clear somewhere. Someone could perhaps argue that "Well, you should have read the YAML they gave you", but I don't think that's fair given how much Calico documentation I had read. I think it is fair to say that if I install Calico, the Calico documentation should work.

I think this is a genuine issue I found-- and explains some of the additional comments above.

mathe-matician commented 1 year ago

Hitting this as well when trying to use apiVersion: projectcalico.org/v3 for my NetworkPolicy objects. Regular apiVersion: networking.k8s.io/v1 NetworkPolicies still work, though. In my existing EKS cluster:

Steps

Using the AWS VPC CNI addon
Clicked on link found in AWS docs https://docs.aws.amazon.com/eks/latest/userguide/calico.html, which takes me to https://docs.tigera.io/calico/3.25/getting-started/kubernetes/helm#install-calico
helm repo add projectcalico https://docs.tigera.io/calico/charts
echo '{ installation: {kubernetesProvider: EKS }}' > values.yaml
kubectl create namespace tigera-operator
helm install calico projectcalico/tigera-operator --version v3.25.1 -f values.yaml --namespace tigera-operator
Once all objects are ready, I go back to the AWS docs and follow the steps there:
kubectl describe daemonset aws-node -n kube-system | grep amazon-k8s-cni: | cut -d ":" -f 3 -> v1.11.4-eksbuild.1
Create append.yaml

cat << EOF > append.yaml
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - patch
EOF

kubectl apply -f <(cat <(kubectl get clusterrole aws-node -o yaml) append.yaml)
kubectl set env daemonset aws-node -n kube-system ANNOTATE_POD_IP=true
kubectl delete pod <calico-kube-controllers-pod> -n calico-system
kubectl describe pod calico-kube-controllers-5cd7d477df-2xqpd -n calico-system | grep vpc.amazonaws.com/pod-ips (it does exist)

I try to see if I can use the calico-api with the Helm chart's installed apiserver, but continue to get:

E0803 15:04:09.442090   50067 memcache.go:287] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0803 15:04:09.486033   50067 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0803 15:04:09.531374   50067 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request

I delete the default apiserver as others have mentioned here and try to recreate it with the Operator Install or the Manifest Install methods mentioned here: https://docs.tigera.io/calico/3.25/operations/install-apiserver.

Once the installation is complete via either method, I then try kubectl api-resources | grep '\sprojectcalico.org', but still get:

E0803 14:59:07.147463   48902 memcache.go:287] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0803 14:59:07.222069   48902 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
error: unable to retrieve the complete list of server APIs: projectcalico.org/v3: the server is currently unable to handle the request

Logs / Other

Here are some possibly relevant logs from the tigera-operator after using the Operator Install method:

{"level":"info","ts":1691095543.5946114,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"tigera-operator","Request.Name":"calico-apiserver-certs"}
{"level":"error","ts":1691095543.798494,"msg":"Reconciler error","controller":"apiserver-controller","object":{"name":"calico-apiserver-certs","namespace":"tigera-operator"},"namespace":"tigera-operator","name":"calico-apiserver-certs","reconcileID":"a1f3fc0f-ac79-44ef-8da9-aa34b9b4b91a","error":"Operation cannot be fulfilled on apiservers.operator.tigera.io \"default\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:326\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:234"}
{"level":"info","ts":1691095543.7985697,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1691095544.0377092,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"tigera-operator","Request.Name":"calico-apiserver-certs"}
{"level":"info","ts":1691095546.0342875,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"apiserver"}
{"level":"info","ts":1691095546.050731,"logger":"status_manager","msg":"update to tigera status conflicted, retrying","reason":"Operation cannot be fulfilled on tigerastatuses.operator.tigera.io \"apiserver\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1691095546.3223627,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"apiserver"}
{"level":"info","ts":1691095551.033454,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"apiserver"}
{"level":"info","ts":1691095556.035781,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"apiserver"}
{"level":"info","ts":1691095556.0412443,"logger":"status_manager","msg":"update to tigera status conflicted, retrying","reason":"Operation cannot be fulfilled on tigerastatuses.operator.tigera.io \"apiserver\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1691095556.0576544,"logger":"status_manager","msg":"update to tigera status conflicted, retrying","reason":"Operation cannot be fulfilled on tigerastatuses.operator.tigera.io \"apiserver\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1691095556.27758,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"apiserver"}
{"level":"info","ts":1691095576.3226435,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"apiserver"}
{"level":"info","ts":1691095591.0325649,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"apiserver"}
{"level":"info","ts":1691095591.0371556,"logger":"status_manager","msg":"update to tigera status conflicted, retrying","reason":"Operation cannot be fulfilled on tigerastatuses.operator.tigera.io \"apiserver\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1691095591.2886333,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"apiserver"}
{"level":"info","ts":1691095611.03601,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"apiserver"}
{"level":"info","ts":1691095611.3475456,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"apiserver"}
{"level":"info","ts":1691095621.289486,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"apiserver"}

Calico Installation object output:

E0803 16:14:27.786427   63214 memcache.go:287] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0803 16:14:27.838435   63214 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0803 16:14:27.897089   63214 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  annotations:
    meta.helm.sh/release-name: calico
    meta.helm.sh/release-namespace: tigera-operator
  creationTimestamp: "2023-08-03T19:52:47Z"
  finalizers:
  - tigera.io/operator-cleanup
  generation: 2
  labels:
    app.kubernetes.io/managed-by: Helm
  name: default
  resourceVersion: "16370097"
  uid: 4b7ceed0-61a0-4451-af48-5bf3fff7a98b
spec:
  calicoNetwork:
    bgp: Disabled
    linuxDataplane: Iptables
  cni:
    ipam:
      type: AmazonVPC
    type: AmazonVPC
  controlPlaneReplicas: 2
  flexVolumePath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
  imagePullSecrets: []
  kubeletVolumePluginPath: /var/lib/kubelet
  kubernetesProvider: EKS
  nodeUpdateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate
  nonPrivileged: Disabled
  variant: Calico
status:
  calicoVersion: v3.25.1
  computed:
    calicoNetwork:
      bgp: Disabled
      linuxDataplane: Iptables
    cni:
      ipam:
        type: AmazonVPC
      type: AmazonVPC
    controlPlaneReplicas: 2
    flexVolumePath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
    kubeletVolumePluginPath: /var/lib/kubelet
    kubernetesProvider: EKS
    nodeUpdateStrategy:
      rollingUpdate:
        maxUnavailable: 1
      type: RollingUpdate
    nonPrivileged: Disabled
    variant: Calico
  conditions:
  - lastTransitionTime: "2023-08-03T21:13:51Z"
    message: All Objects Available
    observedGeneration: 2
    reason: AllObjectsAvailable
    status: "False"
    type: Progressing
  - lastTransitionTime: "2023-08-03T21:13:51Z"
    message: All Objects Available
    observedGeneration: 2
    reason: AllObjectsAvailable
    status: "False"
    type: Degraded
  - lastTransitionTime: "2023-08-03T21:13:51Z"
    message: All objects available
    observedGeneration: 2
    reason: AllObjectsAvailable
    status: "True"
    type: Ready
  mtu: 9001
  variant: Calico

As @sig-piskule mentions above I see the failed discovery check kubectl describe apiservice/v3.projectcalico.org:

E0803 16:08:17.297531   62928 memcache.go:287] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0803 16:08:17.351823   62928 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0803 16:08:17.403059   62928 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
Name:         v3.projectcalico.org
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  apiregistration.k8s.io/v1
Kind:         APIService
Metadata:
  Creation Timestamp:  2023-08-03T21:03:10Z
  Owner References:
    API Version:           operator.tigera.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  APIServer
    Name:                  default
    UID:                   9fe456a6-e9ec-4d2c-a97f-16193da57fe2
  Resource Version:        16368044
  UID:                     5a2d4b60-8814-465f-bb45-3dcbf0973b33
Spec:
  Ca Bundle:               <redacted>
  Group:                   projectcalico.org
  Group Priority Minimum:  1500
  Service:
    Name:            calico-api
    Namespace:       calico-apiserver
    Port:            443
  Version:           v3
  Version Priority:  200
Status:
  Conditions:
    Last Transition Time:  2023-08-03T21:03:10Z
    Message:               failing or missing response from https://10.1.150.94:5443/apis/projectcalico.org/v3: Get "https://10.1.150.94:5443/apis/projectcalico.org/v3": dial tcp 10.1.150.94:5443: i/o timeout
    Reason:                FailedDiscoveryCheck
    Status:                False
    Type:                  Available
Events:                    <none>

quulah commented 1 year ago

Also hitting this on EKS, without the VPC CNI addon. We're using Calico for the CNI.

This showed up in a Calico upgrade, where we also jumped from manifests to the Tigera Operator Helm Chart.

hostNetwork: true is being set accordingly.

Earlier I deleted the API server, let it be installed again, and the problem was gone for a while. Now it seems to have come back.

I'm not entirely sure yet, but it also seems to be blocking the deletion of namespaces with ArgoCD. Or at least we get a NamespaceDeletionDiscoveryFailure in the conditions, with the error in the message field.

Other than that, it's fairly annoying as it spams many lines whenever you run kubectl.

kenwjiang commented 1 year ago

Still getting this same issue with VPC CNI + tigera-operator helm chart installation. Assuming the fix is just "not to use v3.projectcalico.org API objects"?

cucker0 commented 1 year ago

rebuild pod calico-apiserver and calico-kube-controllers

kubectl -n calico-apiserver delete pod/calico-apiserver-xx

kubectl -n calico-apiserver delete pod/calico-apiserver-xx

kubectl -n calico-system delete pod calico-kube-controllers-xx

kenwjiang commented 1 year ago

Those of you who are still running into this issue & using VPC, check your routing tables and see if the apiServer ports are allowed on your routing tables. We got this running by allowing the connection ports

headyj commented 1 year ago

@kenwjiang could you please elaborate on that? Because I'm having this issue using EKS with VPC. Did you enabled the port on a specific security group?

kenwjiang commented 1 year ago

@headyj added this security rule in the eks cluster:

 node_security_group_additional_rules = {
    # calico-apiserver
    ingress_cluster_5443_webhook = {
      description                   = "Cluster API to node 5443/tcp webhook"
      protocol                      = "tcp"
      from_port                     = 5443
      to_port                       = 5443
      type                          = "ingress"
      source_cluster_security_group = true
    }
  }

diranged commented 10 months ago

I'm not entirely sure yet, but it also seems to be blocking the deletion of namespaces with ArgoCD. Or at least we get a NamespaceDeletionDiscoveryFailure in the conditions, with the error in the message field.

We're seeing this problem suddenly when we upgrade Tigera Operator from v3.26.4 -> v3.27.0. When we delete the operator and then try to delete the Namespace, we get stuck on the Kubernetes finalizer throwing this error:

  - lastTransitionTime: "2023-12-19T04:08:27Z"
    message: 'Discovery failed for some groups, 1 failing: unable to retrieve the
      complete list of server APIs: projectcalico.org/v3: stale GroupVersion discovery:
      projectcalico.org/v3'
    reason: DiscoveryFailed
    status: "True"
    type: NamespaceDeletionDiscoveryFailure

This is reproducible ... every time we install into an integration test cluster, we cannot purge the namespace.

caseydavenport commented 10 months ago

That error sounds like something is attempting to lookup projectcalico.org/v3 resources in order to GC them, but that the apiserver is running. Can you check that the Calico apiserver is or isn't in fact running and healthy on this cluster?

diranged commented 10 months ago

That error sounds like something is attempting to lookup projectcalico.org/v3 resources in order to GC them, but that the apiserver is running. Can you check that the Calico apiserver is or isn't in fact running and healthy on this cluster?

The thing is - this is happening when we're deleting the Calico resources. This is a new behavior too - it did not happen in 3.26.

caseydavenport commented 10 months ago

When we delete the operator and then try to delete the Namespace, we get stuck on the Kubernetes finalizer throwing this error:

Could you provide more concretely the steps you're taking here? What steps do you take to delete the operator? Are you deleting the CRDs within tigera-operator.yaml as well?

this is happening when we're deleting the Calico resources

Which Calico resources?

I'd be curious about the output from following commands as well, captured while encountering the error:

kubectl get pods -n calico-apiserver
kubectl get apiservers -o yaml
kubectl get apiservice v3.projectcalico.org -o yaml

Donkey1022 commented 8 months ago

Reason： Public cloud does not support routing mode，Cross host access to pods address unreachable (The underlying network is implemented through flow tables)

Method：kubectl edit ippools.crd.projectcalico.org default-ipv4-ippool （ipipMode: Always or vxlanMode：Always ）

greghall76 commented 6 months ago

I am seeing this as well running a kubeadm cluster (1.26.15) on AWS w/ NO VPC CNI.

I just upgraded operator to 1.32.7 and calico cni to 3.27.3.

My Installation is running on subnets w/ security groups that pass all traffic originating on the SG to support VXLAN.

calicoNetwork:
      bgp: Enabled
      hostPorts: Enabled
      ipPools:
        - blockSize: 26
          cidr: 10.200.0.0/24
          disableBGPExport: false
          encapsulation: VXLAN
          natOutgoing: Enabled
          nodeSelector: all()

... This much is good I believe ?....

kubectl api-resources | grep calico

bgpconfigurations                                                                 crd.projectcalico.org/v1                    false        BGPConfiguration
bgpfilters                                                                        crd.projectcalico.org/v1                    false        BGPFilter
bgppeers                                                                          crd.projectcalico.org/v1                    false        BGPPeer
blockaffinities                                                                   crd.projectcalico.org/v1                    false        BlockAffinity
caliconodestatuses                                                                crd.projectcalico.org/v1                    false        CalicoNodeStatus
clusterinformations                                                               crd.projectcalico.org/v1                    false        ClusterInformation
felixconfigurations                                                               crd.projectcalico.org/v1                    false        FelixConfiguration
globalnetworkpolicies                                                             crd.projectcalico.org/v1                    false        GlobalNetworkPolicy
globalnetworksets                                                                 crd.projectcalico.org/v1                    false        GlobalNetworkSet
hostendpoints                                                                     crd.projectcalico.org/v1                    false        HostEndpoint
ipamblocks                                                                        crd.projectcalico.org/v1                    false        IPAMBlock
ipamconfigs                                                                       crd.projectcalico.org/v1                    false        IPAMConfig
ipamhandles                                                                       crd.projectcalico.org/v1                    false        IPAMHandle
ippools                                                                           crd.projectcalico.org/v1                    false        IPPool
ipreservations                                                                    crd.projectcalico.org/v1                    false        IPReservation
kubecontrollersconfigurations                                                     crd.projectcalico.org/v1                    false        KubeControllersConfiguration
networkpolicies                                                                   crd.projectcalico.org/v1                    true         NetworkPolicy
networksets                                                                       crd.projectcalico.org/v1                    true         NetworkSet
bgpconfigurations                 bgpconfig,bgpconfigs                            projectcalico.org/v3                        false        BGPConfiguration
bgpfilters                                                                        projectcalico.org/v3                        false        BGPFilter
bgppeers                                                                          projectcalico.org/v3                        false        BGPPeer
blockaffinities                   blockaffinity,affinity,affinities               projectcalico.org/v3                        false        BlockAffinity
caliconodestatuses                caliconodestatus                                projectcalico.org/v3                        false        CalicoNodeStatus
clusterinformations               clusterinfo                                     projectcalico.org/v3                        false        ClusterInformation
felixconfigurations               felixconfig,felixconfigs                        projectcalico.org/v3                        false        FelixConfiguration
globalnetworkpolicies             gnp,cgnp,calicoglobalnetworkpolicies            projectcalico.org/v3                        false        GlobalNetworkPolicy
globalnetworksets                                                                 projectcalico.org/v3                        false        GlobalNetworkSet
hostendpoints                     hep,heps                                        projectcalico.org/v3                        false        HostEndpoint
ipamconfigurations                ipamconfig                                      projectcalico.org/v3                        false        IPAMConfiguration
ippools                                                                           projectcalico.org/v3                        false        IPPool
ipreservations                                                                    projectcalico.org/v3                        false        IPReservation
kubecontrollersconfigurations                                                     projectcalico.org/v3                        false        KubeControllersConfiguration
networkpolicies                   cnp,caliconetworkpolicy,caliconetworkpolicies   projectcalico.org/v3                        true         NetworkPolicy
networksets                       netsets                                         projectcalico.org/v3                        true         NetworkSet
profiles                                                                          projectcalico.org/v3                        false        Profile

ghall.fc2dev@U-18QJ8WMEL0ADJ:~/prj/forgec2/omni/blueprint/ansible$ kubectl get apiservices | grep calico
v1.crd.projectcalico.org                  Local                                   True        40h
v3.projectcalico.org                        calico-apiserver/calico-api   True        233d

I believe this log on the k8s api server is relevant. The IP address of the called service is my calico api server. So there's a timeout when k8s api tries to use the calico api service.

1 available_controller.go:456] v3.projectcalico.org failed with: failing or missing response from https://10.96.17.213:443/apis/projectcalico.org/v3: Get "https://10.96.17.213:443/apis/projectcalico.org/v3": dial tcp 10.96.17.213:443: i/o timeout