Open lucasscheepers opened 1 year ago
Do I understand it correctly that these crd.projectcalico.org/v1 CRDs are still needed - so not deleting them - and I need to manually install the v3 CRDs? If so, where can I download these v3 CRDs as I can't find it
The v1 resources are CRDs and should be present - definitely don't delete those.
The v3 resources are not CRDs - they are implemented by the calico-apiserver pod in the calico-apiserver namespace.
the server is currently unable to handle the request
This suggests a problem with the Calico API server, or a problem with the kube-apiserver being unable to communicate with the Calico API server. I'd:
kubectl describe apiservice
to see if there's any breadcrumbs to follow there.kubectl logs --tail=-1 -n calico-apiserver -l k8s-app=calico-apiserver
to get the full API server logs.kubectl describe tigerastatus apiserver
for potentially further breadcrumbs. The only apiservice that has a status of false is v3.projectcalico.org
that has the following error message: failing or missing response from https://***:443/apis/projectcalico.org/v3: Get "https://***:443/apis/projectcalico.org/v3": context deadline exceeded
➜ ~ kubectl describe apiservice v3.projectcalico.org
Name: v3.projectcalico.org
Namespace:
Labels: <none>
Annotations: <none>
API Version: apiregistration.k8s.io/v1
Kind: APIService
Metadata:
Creation Timestamp: 2023-04-06T12:54:20Z
Managed Fields:
API Version: apiregistration.k8s.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:ownerReferences:
.:
k:{"uid":"***"}:
f:spec:
f:caBundle:
f:group:
f:groupPriorityMinimum:
f:service:
.:
f:name:
f:namespace:
f:port:
f:version:
f:versionPriority:
Manager: operator
Operation: Update
Time: 2023-04-06T12:54:20Z
API Version: apiregistration.k8s.io/v1
Fields Type: FieldsV1
fieldsV1:
f:status:
f:conditions:
.:
k:{"type":"Available"}:
.:
f:lastTransitionTime:
f:message:
f:reason:
f:status:
f:type:
Manager: kube-apiserver
Operation: Update
Subresource: status
Time: 2023-04-18T12:35:03Z
Owner References:
API Version: operator.tigera.io/v1
Block Owner Deletion: true
Controller: true
Kind: APIServer
Name: default
UID: ***
Resource Version: ***
UID: ***
Spec:
Ca Bundle: ***
Group: projectcalico.org
Group Priority Minimum: 1500
Service:
Name: calico-api
Namespace: calico-apiserver
Port: 443
Version: v3
Version Priority: 200
Status:
Conditions:
Last Transition Time: 2023-04-06T12:54:20Z
Message: failing or missing response from https://10.107.208.239:443/apis/projectcalico.org/v3: Get "https://10.107.208.239:443/apis/projectcalico.org/v3": dial tcp 10.107.208.239:443: i/o timeout
Reason: FailedDiscoveryCheck
Status: False
Type: Available
Events: <none>
The logs of the calico-apiservice
looks like this:
➜ ~ kubectl logs --tail=-1 -n calico-apiserver -l k8s-app=calico-apiserver
Version: v3.25.1
Build date: 2023-03-30T23:52:23+0000
Git tag ref: v3.25.1
Git commit: 82dadbce1
I0413 15:15:19.483989 1 plugins.go:158] Loaded 2 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,MutatingAdmissionWebhook.
I0413 15:15:19.484036 1 plugins.go:161] Loaded 1 validating admission controller(s) successfully in the following order: ValidatingAdmissionWebhook.
I0413 15:15:19.604542 1 run_server.go:69] Running the API server
I0413 15:15:19.604578 1 run_server.go:58] Starting watch extension
W0413 15:15:19.606431 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0413 15:15:19.630055 1 secure_serving.go:210] Serving securely on [::]:5443
I0413 15:15:19.630147 1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/calico-apiserver-certs/tls.crt::/calico-apiserver-certs/tls.key"
I0413 15:15:19.630257 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0413 15:15:19.630679 1 run_server.go:80] apiserver is ready.
I0413 15:15:19.631104 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0413 15:15:19.631114 1 shared_informer.go:255] Waiting for caches to sync for RequestHeaderAuthRequestController
I0413 15:15:19.631204 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0413 15:15:19.631212 1 shared_informer.go:255] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0413 15:15:19.631282 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0413 15:15:19.631290 1 shared_informer.go:255] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0413 15:15:19.732007 1 shared_informer.go:262] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0413 15:15:19.732076 1 shared_informer.go:262] Caches are synced for RequestHeaderAuthRequestController
I0413 15:15:19.732510 1 shared_informer.go:262] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
Version: v3.25.1
Build date: 2023-03-30T23:52:23+0000
Git tag ref: v3.25.1
Git commit: 82dadbce1
I0413 15:15:45.802642 1 plugins.go:158] Loaded 2 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,MutatingAdmissionWebhook.
I0413 15:15:45.802806 1 plugins.go:161] Loaded 1 validating admission controller(s) successfully in the following order: ValidatingAdmissionWebhook.
I0413 15:15:45.871553 1 run_server.go:58] Starting watch extension
I0413 15:15:45.871726 1 run_server.go:69] Running the API server
W0413 15:15:45.872885 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0413 15:15:45.885723 1 secure_serving.go:210] Serving securely on [::]:5443
I0413 15:15:45.886356 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0413 15:15:45.886370 1 shared_informer.go:255] Waiting for caches to sync for RequestHeaderAuthRequestController
I0413 15:15:45.886523 1 run_server.go:80] apiserver is ready.
I0413 15:15:45.886549 1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/calico-apiserver-certs/tls.crt::/calico-apiserver-certs/tls.key"
I0413 15:15:45.886667 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0413 15:15:45.888123 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0413 15:15:45.888133 1 shared_informer.go:255] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0413 15:15:45.888363 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0413 15:15:45.888375 1 shared_informer.go:255] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0413 15:15:45.986627 1 shared_informer.go:262] Caches are synced for RequestHeaderAuthRequestController
I0413 15:15:45.988477 1 shared_informer.go:262] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0413 15:15:45.988829 1 shared_informer.go:262] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
And the tigerastatus apiserver looks like this: ➜ ~ kubectl describe tigerastatus apiserver
Name: apiserver
Namespace:
Labels: <none>
Annotations: <none>
API Version: operator.tigera.io/v1
Kind: TigeraStatus
Metadata:
Creation Timestamp: 2023-03-24T16:01:19Z
Generation: 1
Managed Fields:
API Version: operator.tigera.io/v1
Fields Type: FieldsV1
fieldsV1:
f:spec:
Manager: operator
Operation: Update
Time: 2023-03-24T16:01:19Z
API Version: operator.tigera.io/v1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:conditions:
Manager: operator
Operation: Update
Subresource: status
Time: 2023-04-13T15:15:56Z
Resource Version: ***
UID: ***
Spec:
Status:
Conditions:
Last Transition Time: 2023-04-06T12:54:24Z
Message: All Objects Available
Observed Generation: 1
Reason: AllObjectsAvailable
Status: False
Type: Degraded
Last Transition Time: 2023-04-13T15:15:56Z
Message: All objects available
Observed Generation: 1
Reason: AllObjectsAvailable
Status: True
Type: Available
Last Transition Time: 2023-04-13T15:15:56Z
Message: All Objects Available
Observed Generation: 1
Reason: AllObjectsAvailable
Status: False
Type: Progressing
Events: <none>
@caseydavenport Can you maybe point me in the correct direction with this information?
I also meet this issue, my cluster bases on openstack VMs.
@lucasscheepers I was running into the same issue and was able to get around it by following the Manifest Install directions here: https://docs.tigera.io/calico/latest/operations/install-apiserver
Specifically the patch command fixed the issue:
kubectl patch apiservice v3.projectcalico.org -p \ "{\"spec\": {\"caBundle\": \"$(kubectl get secret -n calico-apiserver calico-apiserver-certs -o go-template='{{ index .data "apiserver.crt" }}')\"}}"
getting the same issue while trying to install prometheus operator.
❯ k get apiservices.apiregistration.k8s.io -A
NAME SERVICE AVAILABLE AGE
v2.autoscaling Local True 59d
v2beta1.helm.toolkit.fluxcd.io Local True 11d
v2beta2.autoscaling Local True 59d
v3.projectcalico.org calico-apiserver/calico-api False (FailedDiscoveryCheck) 6m38s
❯ k get apiservices.apiregistration.k8s.io v3.projectcalico.org -o yaml
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
creationTimestamp: "2023-05-08T00:17:15Z"
name: v3.projectcalico.org
ownerReferences:
- apiVersion: operator.tigera.io/v1
blockOwnerDeletion: true
controller: true
kind: APIServer
name: default
uid: 34d2ccfa-07e2-4ec8-82b1-25a3e9e3be73
resourceVersion: "25407370"
uid: 830669ef-f81a-4e2d-9765-e3e2066f8f33
spec:
caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUR5akNDQXJLZ0F3SUJBZ0lJVjgwSlNBRkFKWkF3RFFZSktvWklodmNOQVFFTEJRQXdJVEVmTUIwR0ExVUUKQXhNV2RHbG5aWEpoTFc5d1pYSmhkRzl5TFhOcFoyNWxjakFlRncweU16QTFNRGN3TVRNMU5EVmF
group: projectcalico.org
groupPriorityMinimum: 1500
service:
name: calico-api
namespace: calico-apiserver
port: 443
version: v3
versionPriority: 200
status:
conditions:
- lastTransitionTime: "2023-05-08T00:17:15Z"
message: 'failing or missing response from https://10.20.3.132:5443/apis/projectcalico.org/v3:
Get "https://10.20.3.132:5443/apis/projectcalico.org/v3": dial tcp 10.20.3.132:5443:
i/o timeout'
reason: FailedDiscoveryCheck
status: "False"
type: Available
resolved by using new helm release version for prometheus operator
but still getting
❯ k get apiservices.apiregistration.k8s.io v3.projectcalico.org -o yaml
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
creationTimestamp: "2023-05-11T19:04:37Z"
name: v3.projectcalico.org
ownerReferences:
group: projectcalico.org
groupPriorityMinimum: 1500
service:
name: calico-api
namespace: calico-apiserver
port: 443
version: v3
versionPriority: 200
status:
conditions:
- lastTransitionTime: "2023-05-11T19:04:37Z"
message: 'failing or missing response from https://10.10.101.253:5443/apis/projectcalico.org/v3:
Get "https://10.10.101.253:5443/apis/projectcalico.org/v3": dial tcp 10.10.101.253:5443:
i/o timeout'
reason: FailedDiscoveryCheck
status: "False"
type: Available
@lucasscheepers did you manage to resolve it? Namespace deletions get stuck in terminating state due to this apiservice not being responsive. Any way of turning it off?
We are faced with same problem when installing tigera-operator helm chart with APIServer enabled on EKS cluster.
We are faced with same problem when installing tigera-operator helm chart with APIServer enabled on EKS cluster.
I was also facing the same issue, and fixed it for the time being by running
kubectl delete apiserver default
Based on https://docs.tigera.io/calico/latest/operations/install-apiserver#uninstall-the-calico-api-server
Since we are using the default calico helm chart based install, I think the apiserver was getting created, but not configuring it properly perhaps. And since I doubt if we have a need to update the Calico settings from kubectl as part of our use-case, I think it is best to delete it for now. I will also try to find some helm value in the tigera-operator to disable this from start if possible.
PS: I am new to Calico, and please let me know if this is "unsafe" to remove, although the documentation above does not seem to suggest so.
EDIT It is easy to disable the apiServer with the helm values
apiServer:
enabled: true # Change to false
Also, it seems it is not so important after all - https://docs.tigera.io/calico/latest/reference/architecture/overview#calico-api-server The component architecture says it is only needed to manage calico with kubectl, and I think that would logically mean, it is not used from "within" the cluster.
If you are uninstalling apiserver, then you won't be able to install networkpolicies with helm chart. Working around the problem doesn't solve the issue.
For some reason, the calico-apiserver pod is failing on liveness probes because the apiserver is not starting correctly or something is not working at all, due to that, the apiservice
is getting reported as FailedDiscoveryCheck
. I tried to play around with deployment and other things but wasn't able to achieve something. Is there any way to enable debug logs for apiserver?
I also saw that csi nodedriver for calico was failing with following reason:
kubectl logs -f -n calico-system csi-node-driver-pzwsl -c csi-node-driver-registrar
/usr/local/bin/node-driver-registrar: error while loading shared libraries: libresolv.so.2: cannot open shared object file: No such file or directory
If you are uninstalling apiserver, then you won't be able to install networkpolicies with helm chart.
I am a bit confused and might be stupid here, but I think, granted without the calico api server you will not be able to use the projectcalico.org/v3 apiVersion, but you should still be able to use networking.k8s.io/v1 for the NetworkPolicy resource? I don't know if there is a major difference between both and a quick search says I still can use the latter within a given Kubernetes cluster with Calico CNI installed.
Working around the problem doesn't solve the issue.
Yeah, definitely the issue has to be addressed, I just wanted it to stop shouting the error a dozen times every time I try to deploy something into my EKS with helm.
The whole point of installing this operator is being able to use projectcalico.org/v3 resources.
Extends Kubernetes network policy Calico network policy provides a richer set of policy capabilities than Kubernetes including: policy ordering/priority, deny rules, and more flexible match rules. While Kubernetes network policy applies only to pods, Calico network policy can be applied to multiple types of endpoints including pods, VMs, and host interfaces. Finally, when used with Istio service mesh, Calico network policy supports securing applications layers 5-7 match criteria, and cryptographic identity.
The whole point of installing this operator is being able to use projectcalico.org/v3 resources.
Ah, for us the whole point was to replace the default AWS EKS VPC CNI with Calico CNI, and be able to deploy more pods per node and save IPs form the VPC CIDR allocated to us - since the former gives all the pods IPs from this range and also limits the number of pods per node based on node size. For us the Calico installation using Helm from the official documentation (which installs the operator) introduced this apiServer and the related errors.
So, guess the solution is valid if you just want the CNI and are limited to use with the K8s NetworkPolicy!
same, running it with tigera on EKS. Deinstalled API server and resolved it. We will see what the consequences will be. According to docs it should only block tigera cli stuff...
@xpuska513 this seems like an issue with csi-node-driver-registrar, could you share more details of you setup (versions, etc)?
kubectl logs -f -n calico-system csi-node-driver-pzwsl -c csi-node-driver-registrar
/usr/local/bin/node-driver-registrar: error while loading shared libraries: libresolv.so.2: cannot open shared object file: No such file or directory
Everyone else (if you're still able to reproduce this issue), could you post kubectl logs
for the Calico apiserver pod(s)?
@coutinhop Deployed on EKS using official helm chart - version 3.25 k8s version 3.22, vpc cni - 1.12.6. Lemme try to deploy it again on fresh cluster tomorrow and I can gather more info on it. Is there any logs(from pods) useful that I can share with you?
Small observation from me, it works fine on older k8s(1.21.4) and older vpc cni release(1.8.x)
Everyone else (if you're still able to reproduce this issue), could you post
kubectl logs
for the Calico apiserver pod(s)?
$ kubectl logs calico-apiserver-7fb88d684f-fh5x7 -n calico-apiserver
E0517 13:12:47.384967 36471 memcache.go:287] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0517 13:12:47.387141 36471 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0517 13:12:47.389521 36471 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
Version: v3.25.1
Build date: 2023-03-30T23:52:23+0000
Git tag ref: v3.25.1
Git commit: 82dadbce1
I0517 12:43:22.874404 1 plugins.go:158] Loaded 2 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,MutatingAdmissionWebhook.
I0517 12:43:22.874578 1 plugins.go:161] Loaded 1 validating admission controller(s) successfully in the following order: ValidatingAdmissionWebhook.
I0517 12:43:22.955555 1 run_server.go:69] Running the API server
I0517 12:43:22.970749 1 run_server.go:58] Starting watch extension
W0517 12:43:22.970895 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0517 12:43:22.976511 1 secure_serving.go:210] Serving securely on [::]:5443
I0517 12:43:22.976690 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0517 12:43:22.976781 1 shared_informer.go:255] Waiting for caches to sync for RequestHeaderAuthRequestController
I0517 12:43:22.976884 1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/calico-apiserver-certs/tls.crt::/calico-apiserver-certs/tls.key"
I0517 12:43:22.977142 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0517 12:43:22.977618 1 run_server.go:80] apiserver is ready.
I0517 12:43:22.977748 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0517 12:43:22.977829 1 shared_informer.go:255] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0517 12:43:22.977901 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0517 12:43:22.977966 1 shared_informer.go:255] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0517 12:43:23.077860 1 shared_informer.go:262] Caches are synced for RequestHeaderAuthRequestController
I0517 12:43:23.078006 1 shared_informer.go:262] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0517 12:43:23.078083 1 shared_informer.go:262] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
k -ncalico-apiserver logs calico-apiserver-8757dcdf8-4z79m E0522 12:27:06.023957 1742797 memcache.go:255] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request E0522 12:27:06.024770 1742797 memcache.go:106] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request E0522 12:27:06.026818 1742797 memcache.go:106] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request Error from server: Get "https://138.96.245.50:10250/containerLogs/calico-apiserver/calico-apiserver-8757dcdf8-4z79m/calico-apiserver": tunnel closed
For those of you running on EKS, can you confirm that the calico-apiserver is running with hostNetwork: true
set?
The Kubernetes API server needs to establish connection with the Calico API server, and on EKS the Kubernetes API server runs in a separate Amazon managed VPC, meaning it doesn't have routing access to pod IPs (just host IPs). As such, the Calico API server needs to run with host networking. The tigera-operator should do this automatically for you, but I'd like to double check in case something isn't detecting this correctly.
Everyone else (if you're still able to reproduce this issue), could you post
kubectl logs
for the Calico apiserver pod(s)?$ kubectl logs calico-apiserver-7fb88d684f-fh5x7 -n calico-apiserver E0517 13:12:47.384967 36471 memcache.go:287] couldn't get resource list for projectcalico.org/v3: the server is currently
For me this was some sort of firewall issue. I have configured firewall to be more permissive and I don't see this issue anymore.
@caseydavenport for me on eks it wasn't running on host network fir some reason.
@xpuska513 interesting. Could you share how you installed Calico / the apiserver on this cluster?
@caseydavenport I followed this guide https://docs.aws.amazon.com/eks/latest/userguide/calico.html which referenced this doc for tigera operator deployment https://docs.tigera.io/calico/latest/getting-started/kubernetes/helm#install-calico basically deployed the helm chart with kubernetesProvider
set to EKS
and I also assume that it auto-deploys apiserver out of the box when you deploy the operator using helm chart.
Yep, gotcha. That is the correct guide to follow. I realize now that if you are using the EKS VPC CNI plugin, then it is OK to have the apiserver running with hostNetwork: false
, so that's unlikely to be the problem here.
Deploying today:
kubectl -n calico-apiserver logs deployment.apps/calico-apiserver
E0526 12:39:07.647072 164182 memcache.go:255] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0526 12:39:07.737485 164182 memcache.go:106] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0526 12:39:07.789410 164182 memcache.go:106] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
Found 2 pods, using pod/calico-apiserver-59dcddc4d5-d4sfp
Version: v3.25.1
Build date: 2023-03-30T23:52:23+0000
Git tag ref: v3.25.1
Git commit: 82dadbce1
I0526 10:06:19.554880 1 plugins.go:158] Loaded 2 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,MutatingAdmissionWebhook.
I0526 10:06:19.554927 1 plugins.go:161] Loaded 1 validating admission controller(s) successfully in the following order: ValidatingAdmissionWebhook.
I0526 10:06:19.643486 1 run_server.go:69] Running the API server
I0526 10:06:19.643504 1 run_server.go:58] Starting watch extension
W0526 10:06:19.643526 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0526 10:06:19.660286 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0526 10:06:19.660394 1 shared_informer.go:255] Waiting for caches to sync for RequestHeaderAuthRequestController
I0526 10:06:19.660287 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0526 10:06:19.660425 1 shared_informer.go:255] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0526 10:06:19.660318 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0526 10:06:19.660592 1 shared_informer.go:255] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0526 10:06:19.660670 1 secure_serving.go:210] Serving securely on [::]:5443
I0526 10:06:19.660521 1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/calico-apiserver-certs/tls.crt::/calico-apiserver-certs/tls.key"
I0526 10:06:19.660736 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0526 10:06:19.661221 1 run_server.go:80] apiserver is ready.
I0526 10:06:19.761160 1 shared_informer.go:262] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0526 10:06:19.761194 1 shared_informer.go:262] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0526 10:06:19.761256 1 shared_informer.go:262] Caches are synced for RequestHeaderAuthRequestController
I followed the guide linked above by @xpuska513 too and hostNetwork: false
(well, not set). I manually installed the CRD's because I was getting errors about the annotation length exceeding the maximum.
because I was getting errors about the annotation length exceeding the maximum.
FWIW, this usually comes from using kubectl apply
since kubectl adds the annotation. You should be able to do kubectl create
and kubectl replace
instead in order to avoid that, and it's what we currently recommend.
I am having this issue as well.
I began by installing the Calico on an AWS EKS cluster. I used the helm chart using v3.25.1. I skipped the Customize Helm Chart section of the documentation, because the [AWS documentation] (https://docs.aws.amazon.com/eks/latest/userguide/calico.html#calico-install) brings you directly to this section, so I did not bother reading the previous sections. As a result, my initial installation was without installation.kubernetesProvider: EKS.
My initial installation succeeded, and I was about to proceed with the remainder of the EKS documentation. However, I noticed that I did not like the choice of my name for the helm installation. I chose the namespace calico
instead of tigera-operator
, and I wanted to match the AWS documentation. As a result, I attempted to delete everything, and reinstall the helm chart. Using tigera-operator
for the namespace was not allowed, possibly due to a bug in the helm chart, and I gave up and went back to using the calico
namespace.
I do not know where things went foul after this section. I don't recall if I attempted to helm uninstall
or what. But I do remember attempting to delete the namespaces, and getting into conditions where the namespace got stuck deleting due to various finalizers
. I attempted to search for the various stuck entities and delete them. I did so successfully, and I was ultimately able to get the namespaces to delete. I believe I ran kubectl get ns calico-system -o yaml
and it warned me it was stuck deleting service accounts.
I can sort of delete and reinstall calico. If I delete calico, even using helm uninstall
, things get strangely stuck. The calico
and calico-apiserver
namespace will delete, but calico-system
remains. If I run kubectl delete ns calico-system
, that gets stuck due to the finalizers
. I can then describe the namespace, where it will warn me about the serviceaccounts
. If I delete the finalizers
for the serviceaccounts
, the calico-system
namespace will finally delete. I can then delete the calico
namespace.
There is definitely left-over resources on my K8S cluster. I found a tool, kubectl really get all which helps show me the additional resources. I had a strong suspicion that my installation is corrupt, and the best way would be to literally rebuild my entire cluster, but that is really not a good idea operationally. When I have live customers, we cannot be expected to rebuild the entire cluster if Calico has an issue.
I tried to delete all leftover resources and see to see if I could get a reinstall working.
# List out all resources in your cluster
~/kubectl-really-get-all --all-namespaces > /tmp/tmp
# Identify Calico components
cat /tmp/tmp | grep -iE "(tigera|calico)"
Once I have the list of calico components I can pipe the output into XARGS to delete everything
# DANGEROUS! DO NOT DO UNLESS YOU KNOW WHAT YOU ARE DOING!
... | awk '{print $1}' | xargs -L1 -I input bash -c "kubectl delete input"
This initially kept getting stuck, so I would need to manually modify the the items that got stuck and remove the finalizer
, such as kubectl edit clusterrole.rbac.authorization.k8s.io/calico-node
.
I then verified that everything was completely deleted using kubectl really get all
. However, after reinstallation, calico
came online. The calico-system
after immediately getting created was stuck in a terminating
state.
status:
conditions:
- lastTransitionTime: "2023-06-23T16:29:14Z"
message: 'Discovery failed for some groups, 1 failing: unable to retrieve the
complete list of server APIs: projectcalico.org/v3: the server is currently
unable to handle the request'
reason: DiscoveryFailed
status: "True"
type: NamespaceDeletionDiscoveryFailure
- lastTransitionTime: "2023-06-23T16:29:14Z"
message: All legacy kube types successfully parsed
reason: ParsedGroupVersions
status: "False"
type: NamespaceDeletionGroupVersionParsingFailure
- lastTransitionTime: "2023-06-23T16:29:14Z"
message: All content successfully deleted, may be waiting on finalization
reason: ContentDeleted
status: "False"
type: NamespaceDeletionContentFailure
- lastTransitionTime: "2023-06-23T16:29:14Z"
message: All content successfully removed
reason: ContentRemoved
status: "False"
type: NamespaceContentRemaining
- lastTransitionTime: "2023-06-23T16:29:14Z"
message: All content-preserving finalizers finished
reason: ContentHasNoFinalizers
status: "False"
type: NamespaceFinalizersRemaining
phase: Terminating
At this point, I'm not sure how to clean my system. I may try it 1 more time, but I may be stuck with VPC + Cluster rebuild.
I was able to get it to install properly and fix the bug. I attribute it to some sort of race condition between calico-system
trying to delete, and calico
trying to install. I can't say exactly what my order of steps were, other than running helm install
and helm uninstall
very quickly.
With everything working, I wanted to make sure I got a 'clean' install of calico
. So I attempted helm uninstall
. This however, results in calico-system
being left-over, and the helm uninstall
failing due to timeout.
With the partially uninstalled helm-chart, I am back my original problem. I attempted to reinstall the calico
helm chart, and the left-over calico-system
is now stuck terminating.
I am now at the conclusion that calico
can't really be uninstalled properly-- or that my initial attempt at uninstalling corrupted the entire system and there is no-way back. I will likely try to get it working again, simply by getting everything installed again.
I was able to get everything 'working' again by deleting the stuck service account (due to the finalizer
) in the calico-system
. It was some combination of installing & uninstalling calico
w/ helm, and it eventually came up clean.
I was able to upgrade calico from terraform
, so I think this hacky way of getting it to work will be OK temporarily.
@sig-piskule thanks for the updates - sounds like you're running up against the known helm uninstall
race condition problems, for which I have a PR in progress: https://github.com/tigera/operator/pull/2662
@caseydavenport
Thank you for the update. That's helpful. Since AWS is directly linking to Calico, you might want to update your document with a big red warning that says "Prior to installing Calico, make sure you are correctly configured your YAML". However, my 2 cents is that the EKS configuration doesn't seem to be actually required, since Calico worked the first time.
Unfortunately, I am hitting the issue again. I don't know if something mysteriously got redeployed, or if it just suddenly started happening, or if I was too tired on a Friday to wait for the problem to start happening again. Regardless, here is what I see:
The problem E0627 11:56:49.015185 15798 memcache.go:255] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
mysteriously started happening again, and I need to unfortunately spend more time debugging this. This is critical to me, since I need to be able to install Network Policies using Infrastructure As Code, similar to this blog.
I came across this link, which provided some hints: https://github.com/helm/helm/issues/6361
This provide some interesting details:
$ kubectl get apiservice | grep calico
v1.crd.projectcalico.org Local True 3d23h
v3.projectcalico.org calico-apiserver/calico-api False (FailedDiscoveryCheck) 3d22h
We can see that the API Service has failed discovery check. Digging in more:
$ kubectl get apiservice v3.projectcalico.org -o yaml
...
status:
conditions:
- lastTransitionTime: "2023-06-23T17:37:10Z"
message: 'failing or missing response from https://10.2.192.137:5443/apis/projectcalico.org/v3:
Get "https://10.2.192.137:5443/apis/projectcalico.org/v3": dial tcp 10.2.192.137:5443:
i/o timeout'
reason: FailedDiscoveryCheck
status: "False"
type: Available
So now I know where the issue is occurring, I can begin to actually diagnose this problem. We should check to what is going on from the pod end that should be serving the requests:
$ kubectl get pods -n calico-apiserver -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-apiserver-86cbf6c7fc-5cj2x 1/1 Running 0 9m24s 10.2.192.250 ip-10-2-192-242.ec2.internal <none> <none>
calico-apiserver-86cbf6c7fc-zvl85 1/1 Running 0 9m25s 10.2.192.137 ip-10-2-192-156.ec2.internal <none> <none>
I ran an ubuntu bastion pod, and from the pod, I curled the API server:
$ curl -k https://10.2.192.137:5443/apis/projectcalico.org/v3
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "forbidden: User \"system:anonymous\" cannot get path \"/apis/projectcalico.org/v3\"",
"reason": "Forbidden",
"details": {},
"code": 403
This is a different error message than the timeout. This indicates that the server is online and responding, but that we are unauthenticated. As a result, this seems like a control plane issue, as as @caseydavenport mentioned before, I could try with hostNetwork: true
I attempted to do this with kubectl edit
, but the deployment will not update the pods. I cannot edit the pods directly either.
$ cat << EOF > /tmp/patch
spec:
template:
spec:
hostNetwork: true
EOF
$ kubectl patch deployment calico-apiserver -n calico-apiserver --patch-file /tmp/patch
deployment.apps/calico-apiserver patched
$ kubectl get deployment -n calico-apiserver -o yaml | grep host
topologyKey: kubernetes.io/hostname
I then considered that perhaps the Tigera Operator was controlling this calude. I investigated if it was possible to modify this from the helm chart, but it does not seem that this is possible, since it is not located in the documentation.
We are planning to use the VPC CNI plugin soon, however it isn't installed yet. Therefore, setting hostNetwork: true
does seem related to the problem as indicated here. I am not sure how it might be possible to set this. It is however suggested this is possible here.
At this point I'm a little lost as to how this can be fixed. I am still digging though, so I may post another update. I am posting as much as I am so that this is perhaps helpful to someone else who stumbles upon this.
EDIT
I'm pretty sure the Operator controls hostNetwork
, and it is impossible to configure this. This code suggests that hostNetwork
is only set to true if you are configured to run EKS and Calico CNI. And This code suggests that hostNetwork
is false by default and not configurable.
Good sleuthing! Agree with what you've said above.
I'm pretty sure the Operator controls hostNetwork, and it is impossible to configure this. This code suggests that hostNetwork is only set to true if you are configured to run EKS and Calico CNI. And This code suggests that hostNetwork is false by default and not configurable.
This is correct - the tigera operator generally owns that setting and there's no user knob for it at the moment. Generally, this is just set to the right thing based on other config and you shouldn't need to worry about it...... That said, to confirm my understanding - you're seeing that running on EKS with Calico CNI, the operator isn't setting hostNetwork: true
on your apiserver pods?
If so, could you share your Installation config so I can confirm all looks OK? (kubectl get installation -o yaml
)
Generally the way to make sure that is set correctly is to ensure that the CNI type is Calico and that the kubernetesProvider
field is set to EKS.
That said, to confirm my understanding - you're seeing that running on EKS with Calico CNI, the operator isn't setting hostNetwork: true on your apiserver pods?
No, that's not what I am doing.
I have a few setups I have tried: USECASE 1: EKS, No CNI whatsoever w/ Service Secondary Subnet (during development while preparing usecase 2) USECASE 2: EKS, with AWS VPC CNI w/ Service Secondary Subnet
My gut is telling me that in order for the API Server to work with this setup, the API Server Must be on the hostNetwork, but that there is no actual way to configure it.
In both cases, I get
- lastTransitionTime: "2023-06-28T19:35:09Z"
message: 'failing or missing response from https://100.112.25.27:5443/apis/projectcalico.org/v3:
Get "https://100.112.25.27:5443/apis/projectcalico.org/v3": dial tcp 100.112.25.27:5443:
i/o timeout'
In both cases, $ kubectl get deployment -n calico-apiserver -o yaml | grep host
shows that hostNetwork
is not set. I furthermore cannot configure it for additional debug. I can confirm that CNI is installed by running
$ aws eks describe-addon --cluster-name $CLUSTER_NAME --addon-name vpc-cni --query addon.addonVers
ion --output text $PROFILE
v1.13.2-eksbuild.1
So CNI is definitely installed. I then ran a few commands:
echo '{ installation: {kubernetesProvider: EKS }}' > values.yaml
kubectl create namespace calico
helm install calico projectcalico/tigera-operator --version v3.26.1 -f values.yaml --namespace calico
And checked everything:
calico-apiserver calico-apiserver-68647c5f95-cfwm6 1/1 Running 0 43m
calico-apiserver calico-apiserver-68647c5f95-dz4bt 1/1 Running 0 43m
calico-system calico-kube-controllers-5977f687c9-lm5zc 1/1 Running 0 44m
calico-system calico-node-56jzf 1/1 Running 0 44m
calico-system calico-node-m8wzb 1/1 Running 0 44m
calico-system calico-typha-6758886c9-hvnfs 1/1 Running 0 44m
calico-system csi-node-driver-9p78n 2/2 Running 0 44m
calico-system csi-node-driver-b9xcf 2/2 Running 0 44m
calico tigera-operator-959786749-w7w76 1/1 Running 0 44m
I did notice that I missed step 5 on this new setup, but this did not resolve anything. This may be helpful for others though, so I am reporting it here:
If you're using version 1.9.3 or later of the Amazon VPC CNI plugin for Kubernetes, then enable the plugin to add the Pod IP address to an annotation in the calico-kube-controllers-55c98678-gh6cc Pod spec. For more information about this setting, see ANNOTATE_POD_IP on GitHub.
After all this, the Calico Stars demo only shows a connection from B to F back, and C is entirely missing.
I have discovered that AWS is doing some "bad" things (With partial blame on Calico for the Stars demo). In particular, your documentation states that we can manage Calico Resources using kubectl here.
If I try the following test (similar to this document):
$ cat << EOF > sample.yaml
apiVersion: projectcalico.org/v3
kind: NetworkPolicy
metadata:
name: allow-tcp-6379
namespace: production
EOF
I get the following error:
$ kubectl apply -f sample.yaml
error: resource mapping not found for name: "allow-access" namespace: "test" from "sample.yaml": no matches for kind "NetworkPolicy" in version "projectcalico.org/v3"
ensure CRDs are installed first
I have discovered that AWS has circumvented the Calico documentation, by using the following API version
This is actually a Calico file that AWS uses, found here.
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
namespace: client
name: allow-ui
spec:
podSelector:
matchLabels: {}
ingress:
- from:
- namespaceSelector:
matchLabels:
role: management-ui
As a result, you can have a completely working AWS Demo by following the AWS Guide, but when you try to do more by following Calico Documentation, you will get stuck due to the API Server. This still doesn't explain why kubectl get pods
was failing on my corrupted cluster or how to fix it
As a result, the AWS Demo doesn't even use the V3 API Server, so it cannot be conclusively determined if it ever worked in the first place.
I might be able to continue with this information, though I am still concerned that my test doesn't show all the nodes connected. That is a possible networking error on my end somewhere. If I figure it out I will post more here.
My gut is telling me that in order for the API Server to work with this setup, the API Server Must be on the hostNetwork, but that there is no actual way to configure it.
USECASE 1: EKS, No CNI whatsoever w/ Service Secondary Subnet (during development while preparing usecase 2)
What do you mean by no CNI whatsoever? A CNI plugin is needed in order for pod networking to function.
USECASE 2: EKS, with AWS VPC CNI w/ Service Secondary Subnet
With the AWS VPC CNI, you do not need hostNetwork: true
. So this is expected.
error: resource mapping not found for name: "allow-access" namespace: "test" from "sample.yaml": no matches for kind "NetworkPolicy" in version "projectcalico.org/v3"
This suggests a problem with the Calico API server, which is consistent with the rest of this issue.
apiVersion: networking.k8s.io/v1
This is the upstream Kubernetes NetworkPolicy API.
In general, I wouldn't think of the stars demo as a comprehensive cluster health check - it's only testing a very specific subset of functionality and it's not intended to certify that a cluster is 100% functional.
Hi @caseydavenport , thanks for getting back to me.
What do you mean by no CNI whatsoever? A CNI plugin is needed in order for pod networking to function.
By default, when you create an EKS cluster, you don't have VPC CNI installed. By default, Pods get the same IP Addresses as nodes. You can through an additional configuration configure Services to get a different CIDR block from the Nodes, but the Pods cannot get a different CIDR block. This leads to the IP space exhaustion. Perhaps there is a CNI installed, but it comes by default, and I did nothing to get it there.
So what I mean by USECASE1, is that an EKS cluster without VPC-CNI installed. Whatever is there by default.
With the AWS VPC CNI, you do not need hostNetwork: true. So this is expected.
I might disagree with you on this. I do not have time to more thoroughly diagnose, as our base requirements are satisfied. What my point is, it is currently not possible to get the Calico API Server to be functional (to serve projectcalico.org/v3
). If you try, you will notice that the service to contact the API server is not accessible. I surmise (I don't know) that hostNetwork: true
even with VPC CNI-- only for the purposes of accessing the API server. The otherfunctionality provided by the Stars demo is there, but it is not possible to get API Server working.
Regardless of my hypotheses, the API Server is not available at the end result of AWS's installation documentation.
In general, I wouldn't think of the stars demo as a comprehensive cluster health check - it's only testing a very specific subset of functionality and it's not intended to certify that a cluster is 100% functional.
Agreed, and that is somewhat my complaint. I can argue that AWS's documentation results in a cluster that is not 100% functional. AWS Furthermore argues that it is functional, via the stars demo.
The entire setup was using the upstream Kubernetes NetworkPolicy API, yet nowhere in the documentation was that made clear. Someone should have said somewhere "By the way, although you installed Calico CNI, it is not 100% functional, and you can't use the Calico API Server, and you can only use Upstream Kubernetes NetworkPolicy API because Calico API does not work".
I wish that point had been made more clear somewhere. Someone could perhaps argue that "Well, you should have read the YAML they gave you", but I don't think that's fair given how much Calico documentation I had read. I think it is fair to say that if I install Calico, the Calico documentation should work.
I think this is a genuine issue I found-- and explains some of the additional comments above.
Hitting this as well when trying to use apiVersion: projectcalico.org/v3
for my NetworkPolicy objects. Regular apiVersion: networking.k8s.io/v1
NetworkPolicies still work, though. In my existing EKS cluster:
helm repo add projectcalico https://docs.tigera.io/calico/charts
echo '{ installation: {kubernetesProvider: EKS }}' > values.yaml
kubectl create namespace tigera-operator
helm install calico projectcalico/tigera-operator --version v3.25.1 -f values.yaml --namespace tigera-operator
kubectl describe daemonset aws-node -n kube-system | grep amazon-k8s-cni: | cut -d ":" -f 3
-> v1.11.4-eksbuild.1
append.yaml
cat << EOF > append.yaml
- apiGroups:
- ""
resources:
- pods
verbs:
- patch
EOF
kubectl apply -f <(cat <(kubectl get clusterrole aws-node -o yaml) append.yaml)
kubectl set env daemonset aws-node -n kube-system ANNOTATE_POD_IP=true
kubectl delete pod <calico-kube-controllers-pod> -n calico-system
kubectl describe pod calico-kube-controllers-5cd7d477df-2xqpd -n calico-system | grep vpc.amazonaws.com/pod-ips
(it does exist)I try to see if I can use the calico-api with the Helm chart's installed apiserver, but continue to get:
E0803 15:04:09.442090 50067 memcache.go:287] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0803 15:04:09.486033 50067 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0803 15:04:09.531374 50067 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
I delete the default apiserver as others have mentioned here and try to recreate it with the Operator Install or the Manifest Install methods mentioned here: https://docs.tigera.io/calico/3.25/operations/install-apiserver.
Once the installation is complete via either method, I then try kubectl api-resources | grep '\sprojectcalico.org'
, but still get:
E0803 14:59:07.147463 48902 memcache.go:287] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0803 14:59:07.222069 48902 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
error: unable to retrieve the complete list of server APIs: projectcalico.org/v3: the server is currently unable to handle the request
Here are some possibly relevant logs from the tigera-operator after using the Operator Install method:
{"level":"info","ts":1691095543.5946114,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"tigera-operator","Request.Name":"calico-apiserver-certs"}
{"level":"error","ts":1691095543.798494,"msg":"Reconciler error","controller":"apiserver-controller","object":{"name":"calico-apiserver-certs","namespace":"tigera-operator"},"namespace":"tigera-operator","name":"calico-apiserver-certs","reconcileID":"a1f3fc0f-ac79-44ef-8da9-aa34b9b4b91a","error":"Operation cannot be fulfilled on apiservers.operator.tigera.io \"default\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:326\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:234"}
{"level":"info","ts":1691095543.7985697,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":1691095544.0377092,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"tigera-operator","Request.Name":"calico-apiserver-certs"}
{"level":"info","ts":1691095546.0342875,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"apiserver"}
{"level":"info","ts":1691095546.050731,"logger":"status_manager","msg":"update to tigera status conflicted, retrying","reason":"Operation cannot be fulfilled on tigerastatuses.operator.tigera.io \"apiserver\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1691095546.3223627,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"apiserver"}
{"level":"info","ts":1691095551.033454,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"apiserver"}
{"level":"info","ts":1691095556.035781,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"apiserver"}
{"level":"info","ts":1691095556.0412443,"logger":"status_manager","msg":"update to tigera status conflicted, retrying","reason":"Operation cannot be fulfilled on tigerastatuses.operator.tigera.io \"apiserver\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1691095556.0576544,"logger":"status_manager","msg":"update to tigera status conflicted, retrying","reason":"Operation cannot be fulfilled on tigerastatuses.operator.tigera.io \"apiserver\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1691095556.27758,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"apiserver"}
{"level":"info","ts":1691095576.3226435,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"apiserver"}
{"level":"info","ts":1691095591.0325649,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"apiserver"}
{"level":"info","ts":1691095591.0371556,"logger":"status_manager","msg":"update to tigera status conflicted, retrying","reason":"Operation cannot be fulfilled on tigerastatuses.operator.tigera.io \"apiserver\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":1691095591.2886333,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"apiserver"}
{"level":"info","ts":1691095611.03601,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"apiserver"}
{"level":"info","ts":1691095611.3475456,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"apiserver"}
{"level":"info","ts":1691095621.289486,"logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"apiserver"}
Calico Installation object output:
E0803 16:14:27.786427 63214 memcache.go:287] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0803 16:14:27.838435 63214 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0803 16:14:27.897089 63214 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
annotations:
meta.helm.sh/release-name: calico
meta.helm.sh/release-namespace: tigera-operator
creationTimestamp: "2023-08-03T19:52:47Z"
finalizers:
- tigera.io/operator-cleanup
generation: 2
labels:
app.kubernetes.io/managed-by: Helm
name: default
resourceVersion: "16370097"
uid: 4b7ceed0-61a0-4451-af48-5bf3fff7a98b
spec:
calicoNetwork:
bgp: Disabled
linuxDataplane: Iptables
cni:
ipam:
type: AmazonVPC
type: AmazonVPC
controlPlaneReplicas: 2
flexVolumePath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
imagePullSecrets: []
kubeletVolumePluginPath: /var/lib/kubelet
kubernetesProvider: EKS
nodeUpdateStrategy:
rollingUpdate:
maxUnavailable: 1
type: RollingUpdate
nonPrivileged: Disabled
variant: Calico
status:
calicoVersion: v3.25.1
computed:
calicoNetwork:
bgp: Disabled
linuxDataplane: Iptables
cni:
ipam:
type: AmazonVPC
type: AmazonVPC
controlPlaneReplicas: 2
flexVolumePath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
kubeletVolumePluginPath: /var/lib/kubelet
kubernetesProvider: EKS
nodeUpdateStrategy:
rollingUpdate:
maxUnavailable: 1
type: RollingUpdate
nonPrivileged: Disabled
variant: Calico
conditions:
- lastTransitionTime: "2023-08-03T21:13:51Z"
message: All Objects Available
observedGeneration: 2
reason: AllObjectsAvailable
status: "False"
type: Progressing
- lastTransitionTime: "2023-08-03T21:13:51Z"
message: All Objects Available
observedGeneration: 2
reason: AllObjectsAvailable
status: "False"
type: Degraded
- lastTransitionTime: "2023-08-03T21:13:51Z"
message: All objects available
observedGeneration: 2
reason: AllObjectsAvailable
status: "True"
type: Ready
mtu: 9001
variant: Calico
As @sig-piskule mentions above I see the failed discovery check kubectl describe apiservice/v3.projectcalico.org
:
E0803 16:08:17.297531 62928 memcache.go:287] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0803 16:08:17.351823 62928 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0803 16:08:17.403059 62928 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
Name: v3.projectcalico.org
Namespace:
Labels: <none>
Annotations: <none>
API Version: apiregistration.k8s.io/v1
Kind: APIService
Metadata:
Creation Timestamp: 2023-08-03T21:03:10Z
Owner References:
API Version: operator.tigera.io/v1
Block Owner Deletion: true
Controller: true
Kind: APIServer
Name: default
UID: 9fe456a6-e9ec-4d2c-a97f-16193da57fe2
Resource Version: 16368044
UID: 5a2d4b60-8814-465f-bb45-3dcbf0973b33
Spec:
Ca Bundle: <redacted>
Group: projectcalico.org
Group Priority Minimum: 1500
Service:
Name: calico-api
Namespace: calico-apiserver
Port: 443
Version: v3
Version Priority: 200
Status:
Conditions:
Last Transition Time: 2023-08-03T21:03:10Z
Message: failing or missing response from https://10.1.150.94:5443/apis/projectcalico.org/v3: Get "https://10.1.150.94:5443/apis/projectcalico.org/v3": dial tcp 10.1.150.94:5443: i/o timeout
Reason: FailedDiscoveryCheck
Status: False
Type: Available
Events: <none>
Also hitting this on EKS, without the VPC CNI addon. We're using Calico for the CNI.
This showed up in a Calico upgrade, where we also jumped from manifests to the Tigera Operator Helm Chart.
hostNetwork: true
is being set accordingly.
Earlier I deleted the API server, let it be installed again, and the problem was gone for a while. Now it seems to have come back.
I'm not entirely sure yet, but it also seems to be blocking the deletion of namespaces with ArgoCD. Or at least we get a NamespaceDeletionDiscoveryFailure
in the conditions, with the error in the message field.
Other than that, it's fairly annoying as it spams many lines whenever you run kubectl
.
Still getting this same issue with VPC CNI + tigera-operator helm chart installation. Assuming the fix is just "not to use v3.projectcalico.org API objects"?
rebuild pod calico-apiserver and calico-kube-controllers
kubectl -n calico-apiserver delete pod/calico-apiserver-xx
kubectl -n calico-apiserver delete pod/calico-apiserver-xx
kubectl -n calico-system delete pod calico-kube-controllers-xx
Those of you who are still running into this issue & using VPC, check your routing tables and see if the apiServer ports are allowed on your routing tables. We got this running by allowing the connection ports
@kenwjiang could you please elaborate on that? Because I'm having this issue using EKS with VPC. Did you enabled the port on a specific security group?
@headyj added this security rule in the eks cluster:
node_security_group_additional_rules = {
# calico-apiserver
ingress_cluster_5443_webhook = {
description = "Cluster API to node 5443/tcp webhook"
protocol = "tcp"
from_port = 5443
to_port = 5443
type = "ingress"
source_cluster_security_group = true
}
}
I'm not entirely sure yet, but it also seems to be blocking the deletion of namespaces with ArgoCD. Or at least we get a
NamespaceDeletionDiscoveryFailure
in the conditions, with the error in the message field.
We're seeing this problem suddenly when we upgrade Tigera Operator from v3.26.4
-> v3.27.0
. When we delete the operator and then try to delete the Namespace, we get stuck on the Kubernetes finalizer throwing this error:
- lastTransitionTime: "2023-12-19T04:08:27Z"
message: 'Discovery failed for some groups, 1 failing: unable to retrieve the
complete list of server APIs: projectcalico.org/v3: stale GroupVersion discovery:
projectcalico.org/v3'
reason: DiscoveryFailed
status: "True"
type: NamespaceDeletionDiscoveryFailure
This is reproducible ... every time we install into an integration test cluster, we cannot purge the namespace.
That error sounds like something is attempting to lookup projectcalico.org/v3 resources in order to GC them, but that the apiserver is running. Can you check that the Calico apiserver is or isn't in fact running and healthy on this cluster?
That error sounds like something is attempting to lookup projectcalico.org/v3 resources in order to GC them, but that the apiserver is running. Can you check that the Calico apiserver is or isn't in fact running and healthy on this cluster?
The thing is - this is happening when we're deleting the Calico resources. This is a new behavior too - it did not happen in 3.26.
When we delete the operator and then try to delete the Namespace, we get stuck on the Kubernetes finalizer throwing this error:
Could you provide more concretely the steps you're taking here? What steps do you take to delete the operator? Are you deleting the CRDs within tigera-operator.yaml as well?
this is happening when we're deleting the Calico resources
Which Calico resources?
I'd be curious about the output from following commands as well, captured while encountering the error:
kubectl get pods -n calico-apiserver
kubectl get apiservers -o yaml
kubectl get apiservice v3.projectcalico.org -o yaml
Reason: Public cloud does not support routing mode,Cross host access to pods address unreachable (The underlying network is implemented through flow tables)
Method:kubectl edit ippools.crd.projectcalico.org default-ipv4-ippool (ipipMode: Always or vxlanMode:Always )
I am seeing this as well running a kubeadm cluster (1.26.15) on AWS w/ NO VPC CNI.
I just upgraded operator to 1.32.7 and calico cni to 3.27.3.
My Installation is running on subnets w/ security groups that pass all traffic originating on the SG to support VXLAN.
calicoNetwork:
bgp: Enabled
hostPorts: Enabled
ipPools:
- blockSize: 26
cidr: 10.200.0.0/24
disableBGPExport: false
encapsulation: VXLAN
natOutgoing: Enabled
nodeSelector: all()
... This much is good I believe ?....
kubectl api-resources | grep calico
bgpconfigurations crd.projectcalico.org/v1 false BGPConfiguration
bgpfilters crd.projectcalico.org/v1 false BGPFilter
bgppeers crd.projectcalico.org/v1 false BGPPeer
blockaffinities crd.projectcalico.org/v1 false BlockAffinity
caliconodestatuses crd.projectcalico.org/v1 false CalicoNodeStatus
clusterinformations crd.projectcalico.org/v1 false ClusterInformation
felixconfigurations crd.projectcalico.org/v1 false FelixConfiguration
globalnetworkpolicies crd.projectcalico.org/v1 false GlobalNetworkPolicy
globalnetworksets crd.projectcalico.org/v1 false GlobalNetworkSet
hostendpoints crd.projectcalico.org/v1 false HostEndpoint
ipamblocks crd.projectcalico.org/v1 false IPAMBlock
ipamconfigs crd.projectcalico.org/v1 false IPAMConfig
ipamhandles crd.projectcalico.org/v1 false IPAMHandle
ippools crd.projectcalico.org/v1 false IPPool
ipreservations crd.projectcalico.org/v1 false IPReservation
kubecontrollersconfigurations crd.projectcalico.org/v1 false KubeControllersConfiguration
networkpolicies crd.projectcalico.org/v1 true NetworkPolicy
networksets crd.projectcalico.org/v1 true NetworkSet
bgpconfigurations bgpconfig,bgpconfigs projectcalico.org/v3 false BGPConfiguration
bgpfilters projectcalico.org/v3 false BGPFilter
bgppeers projectcalico.org/v3 false BGPPeer
blockaffinities blockaffinity,affinity,affinities projectcalico.org/v3 false BlockAffinity
caliconodestatuses caliconodestatus projectcalico.org/v3 false CalicoNodeStatus
clusterinformations clusterinfo projectcalico.org/v3 false ClusterInformation
felixconfigurations felixconfig,felixconfigs projectcalico.org/v3 false FelixConfiguration
globalnetworkpolicies gnp,cgnp,calicoglobalnetworkpolicies projectcalico.org/v3 false GlobalNetworkPolicy
globalnetworksets projectcalico.org/v3 false GlobalNetworkSet
hostendpoints hep,heps projectcalico.org/v3 false HostEndpoint
ipamconfigurations ipamconfig projectcalico.org/v3 false IPAMConfiguration
ippools projectcalico.org/v3 false IPPool
ipreservations projectcalico.org/v3 false IPReservation
kubecontrollersconfigurations projectcalico.org/v3 false KubeControllersConfiguration
networkpolicies cnp,caliconetworkpolicy,caliconetworkpolicies projectcalico.org/v3 true NetworkPolicy
networksets netsets projectcalico.org/v3 true NetworkSet
profiles projectcalico.org/v3 false Profile
ghall.fc2dev@U-18QJ8WMEL0ADJ:~/prj/forgec2/omni/blueprint/ansible$ kubectl get apiservices | grep calico
v1.crd.projectcalico.org Local True 40h
v3.projectcalico.org calico-apiserver/calico-api True 233d
I believe this log on the k8s api server is relevant. The IP address of the called service is my calico api server. So there's a timeout when k8s api tries to use the calico api service.
1 available_controller.go:456] v3.projectcalico.org failed with: failing or missing response from https://10.96.17.213:443/apis/projectcalico.org/v3: Get "https://10.96.17.213:443/apis/projectcalico.org/v3": dial tcp 10.96.17.213:443: i/o timeout
@caseydavenport Your request was to raise a separate issue.
I installed the latest version of calico using this helm chart. The
kube-apiserver-kmaster1
returns the following error in the logs:v3.projectcalico.org failed with: failing or missing response from https://**:443/apis/projectcalico.org/v3
.Also after each random
kubectl
command it returns errors about these CRDs.These CRDs are automitcally installed using the helm chart stated above.
Do I understand it correctly that these
crd.projectcalico.org/v1
CRDs are still needed - so not deleting them - and I need to manually install thev3
CRDs? If so, where can I download thesev3
CRDs as I can't find itI believe chet-tuttle-3 is facing some similar issues