projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.9k stars 1.31k forks source link

Discussion: crd.projectcalico.org/v1 vs projectcalico.org/v3 #6412

Open caseydavenport opened 2 years ago

caseydavenport commented 2 years ago

This issue comes up frequently enough that I think it warrants its own parent issue to explain and discuss. I'll try to keep this up-to-date with the latest thinking and status should it change.

The problem generally manifests itself as one of the following:

  1. no matches for kind "X" in version "projectcalico.org/v3" when attempting to apply a resource.
  2. Applying a resource with apiVersion: crd.projectcalico.org/v1 and Calico not behaving as expected.

TL;DR

Don't touch crd.projectcalico.org/v1 resources. They are not currently supported for end-users and the entire API group is only used internally within Calico. Using any API within that group means you will bypass API validation and defaulting, which is bad and can result in symptoms like # 2 above. You should use projectcalico.org/v3 instead. Note that projectcalico.org/v3 requires that you install the Calico API server in your cluster, and will result in errors similar to # 1 above if the Calico API server is not running.

Ok, but why do it that way?

Well, it's partly because of limitations in CRDs, and partly due to historical reasons. CRDs provide some validation out of the box on their own, but can't do some of the more complex cross-field and cross-object API validation that the Calico API server can perform. For example, making sure that IP pools are consistent with the IPAM block resources within the cluster is a complex validation process that just can't be expressed in an OpenAPI schema. Same goes for some of the defaulting operations (e.g., conditional defaulting based on other fields).

As a result, Calico uses an aggregation API server to perform these complex tasks against projectcalico.org/v3 APIs, and stores the resulting validated and defaulted resources in the "backend" as CRDs within the crd.projectcalico.org/v1 group. Prior to the introduction of said API server, all of that validation and defaulting had to be performed client-side via calicoctl, but data was still stored in the "backend" as CRDs, for Calico itself to consume.

CRD validation has come a long way since Calico initially started using them way back in beta when they were actually called ThirdPartyResources. However, they still don't (and probably won't ever) support the types of validation that Calico currently enforces via its API server.

Pain points

Yes, this model is not perfect and has a few known (non-trivial) pain points that I would love to resolve.

Can we make it better?

Maybe. I hope so! But the solutions are not simple. We'd need to do at least some combination or the following, based on my current best guesses.

caseydavenport commented 2 years ago

Cross-referencing an older, tangential discussion: https://github.com/projectcalico/calico/issues/2923

muff1nman commented 2 years ago

We can't do this without introducing a webhook, which is not really desirable

Why is a webhook less desirable than an apiserver? By putting the validation logic in a webhook, that would remove the need for the apiservice (assuming defaulting could be done in CRD)

caseydavenport commented 2 years ago

Why is a webhook less desirable than an apiserver?

It's not that it's less desirable, per-se, it's mostly just that it suffers from many of the same problems as a separate apiserver does - i.e., running another pod on the cluster that needs its own networking, etc, in order to provide defaulting and validation, rather than performing that within the Kubernetes API server natively.

muff1nman commented 2 years ago

The nice thing about a validating webhook is that it has a builtin toggle for skirting around it when one is in the early stage of installing calico (assuming the webhook ran in the pod network). However, my bet is that a webhook with hostNetwork would be pretty reasonable (shouldn't need to ever have it in failurePolicy=Ignore and should have very little downsides compared to other routes considered with the big benefit of having one api group. It also provides a fairly straight path to the last option:

making the syntax and semantics 100% compatible with what CRD validation and defaulting provides

simplysoft commented 1 year ago

We'd like to extend an already mentioned pain point: Requirement for calico / networking to work, before k8s api server can serve projectcalico.org/v3:

Because the CRD is registered using a k8s service ip, this essentially also requires that routing k8s service cidr works properly on k8s controller nodes. This is in particular difficult if you are running isolated k8s controller nodes and rely on BGP & calico to announce the k8s service cidr routes to the k8s controller nodes.

During our experiments we could end up in a situation where no k8s service routes where present (anymore), and k8s controller node / api server component not being able to properly setup projectcalico.org/v3 because the calico api server service ip was not reachable, the k8s api server then failed because of that

ValdasK commented 1 year ago

It should be documented loud and clear that if API server is not installed that /v3 CRD will NOT work. Spent over half a day struggling to figure this out, as it's not noted anywhere in install documentation

lucasscheepers commented 1 year ago

@caseydavenport I installed the latest version of calico using this helm chart. The kube-apiserver-kmaster1 returns the following error in the logs: v3.projectcalico.org failed with: failing or missing response from https://**:443/apis/projectcalico.org/v3.

Also after each random kubectl command it returns errors about these CRDs.

E0406 15:33:42.335160   55793 memcache.go:106] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
NAME       STATUS   ROLES           AGE   VERSION
kmaster1   Ready    control-plane   19d   v1.26.3
kworker1   Ready    <none>          18d   v1.26.3
kworker2   Ready    <none>          18d   v1.26.3

These CRDs are automitcally installed using the helm chart stated above.

--> k api-resources | grep calico
E0406 15:34:26.465805   55853 memcache.go:255] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0406 15:34:26.481896   55853 memcache.go:106] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
bgpconfigurations                              crd.projectcalico.org/v1               false        BGPConfiguration
bgppeers                                       crd.projectcalico.org/v1               false        BGPPeer
blockaffinities                                crd.projectcalico.org/v1               false        BlockAffinity
caliconodestatuses                             crd.projectcalico.org/v1               false        CalicoNodeStatus
clusterinformations                            crd.projectcalico.org/v1               false        ClusterInformation
felixconfigurations                            crd.projectcalico.org/v1               false        FelixConfiguration
globalnetworkpolicies                          crd.projectcalico.org/v1               false        GlobalNetworkPolicy
globalnetworksets                              crd.projectcalico.org/v1               false        GlobalNetworkSet
hostendpoints                                  crd.projectcalico.org/v1               false        HostEndpoint
ipamblocks                                     crd.projectcalico.org/v1               false        IPAMBlock
ipamconfigs                                    crd.projectcalico.org/v1               false        IPAMConfig
ipamhandles                                    crd.projectcalico.org/v1               false        IPAMHandle
ippools                                        crd.projectcalico.org/v1               false        IPPool
ipreservations                                 crd.projectcalico.org/v1               false        IPReservation
kubecontrollersconfigurations                  crd.projectcalico.org/v1               false        KubeControllersConfiguration
networkpolicies                                crd.projectcalico.org/v1               true         NetworkPolicy
networksets                                    crd.projectcalico.org/v1               true         NetworkSet
error: unable to retrieve the complete list of server APIs: projectcalico.org/v3: the server is currently unable to handle the request

Do I understand it correctly that these crd.projectcalico.org/v1 CRDs are still needed - so not deleting them - and I need to manually install the v3 CRDs? If so, where can I download these v3 CRDs as I can't find it

lucasscheepers commented 1 year ago

Or maybe someone else that can help me with this issue?

chet-tuttle-3 commented 1 year ago

I'm hitting the exact same issue:

kubectl get pods --all-namespaces
E0412 11:03:20.223597   36334 memcache.go:255] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0412 11:03:20.233631   36334 memcache.go:106] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0412 11:03:20.238044   36334 memcache.go:106] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0412 11:03:20.240342   36334 memcache.go:106] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
NAMESPACE          NAME                                       READY   STATUS    RESTARTS   AGE

Looking into the issue I have hit there appears to be failure in the apiserver look at the logs there: kubectl logs calico-apiserver-... -n calico-apiserver

I'm seeing the following error(s):

I0412 11:22:19.867232       1 shared_informer.go:262] Caches are synced for RequestHeaderAuthRequestController
I0412 11:22:19.867272       1 shared_informer.go:262] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0412 11:22:19.867281       1 shared_informer.go:262] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
E0412 11:22:19.867362       1 configmap_cafile_content.go:243] kube-system/extension-apiserver-authentication failed with : missing content for CA bundle "client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
E0412 11:22:19.867395       1 configmap_cafile_content.go:243] key failed with : missing content for CA bundle "client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
E0412 11:22:19.872652       1 configmap_cafile_content.go:243] kube-system/extension-apiserver-authentication failed with : missing content for CA bundle "client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
E0412 11:22:19.873766       1 configmap_cafile_content.go:243] key failed with : missing content for CA bundle "client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
E0412 11:22:19.882920       1 configmap_cafile_content.go:243] kube-system/extension-apiserver-authentication failed with : missing content for CA bundle "client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
E0

Any chance you are seeing the same errors? When I look for the configmap in kube-system:

kubectl get configmaps -n kube-system
E0412 11:25:40.892890   36960 memcache.go:255] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0412 11:25:40.895456   36960 memcache.go:106] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0412 11:25:40.898183   36960 memcache.go:106] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0412 11:25:40.902856   36960 memcache.go:106] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
NAME                                 DATA   AGE
coredns-coredns                      1      22h
extension-apiserver-authentication   1      7d21h
kube-root-ca.crt                     1      7d21h

As you can see it is there and there is content. I took a look at each of the ClusterRoles and RoleBindings as they are laid down in my cluster and it looks like the default service account calico-apiserver has been granted access to the resource above:

All of the calico-apiserver clusterrolbindings point to the same subject:

subjects:
- kind: ServiceAccount
  name: calico-apiserver
  namespace: calico-apiserver

That all looks fine so far and the pod is running with the same service account:

serviceAccount: calico-apiserver
serviceAccountName: calico-apiserver

There is one secret:

name: calico-apiserver-certs
namespace: calico-apiserver

I'm still trying to fumble through this but perhaps you see the same messages in the apiserver? I'm thinking that is the root cause but not quite sure how to address it yet.

cheers 🍻

lucasscheepers commented 1 year ago

Hmm I don't have any errors in the calico-apiserver container. These are my logs in the container:

I0411 15:52:44.507291       1 plugins.go:158] Loaded 2 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,MutatingAdmissionWebhook.
I0411 15:52:44.507487       1 plugins.go:161] Loaded 1 validating admission controller(s) successfully in the following order: ValidatingAdmissionWebhook.
I0411 15:52:44.607507       1 run_server.go:69] Running the API server
I0411 15:52:44.607522       1 run_server.go:58] Starting watch extension
W0411 15:52:44.608892       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0411 15:52:44.618983       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0411 15:52:44.618999       1 shared_informer.go:255] Waiting for caches to sync for RequestHeaderAuthRequestController
I0411 15:52:44.619024       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0411 15:52:44.619030       1 shared_informer.go:255] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0411 15:52:44.619128       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0411 15:52:44.619139       1 shared_informer.go:255] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0411 15:52:44.619808       1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/calico-apiserver-certs/tls.crt::/calico-apiserver-certs/tls.key"
I0411 15:52:44.620777       1 secure_serving.go:210] Serving securely on [::]:5443
I0411 15:52:44.620810       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0411 15:52:44.621897       1 run_server.go:80] apiserver is ready.
I0411 15:52:44.719822       1 shared_informer.go:262] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0411 15:52:44.719830       1 shared_informer.go:262] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0411 15:52:44.719972       1 shared_informer.go:262] Caches are synced for RequestHeaderAuthRequestController
caseydavenport commented 1 year ago

@lucasscheepers / @chet-tuttle-3 please raise a separate issue for that and ping me on it - this is a high-level tracking issue for discussing general strategy, not for individual diagnosis.

lucasscheepers commented 1 year ago

@caseydavenport @chet-tuttle-3 I've created a separate issue

cucker0 commented 1 year ago

How to extend api-version projectcalico.org/v3

  1. Install calico APIServer ref https://docs.tigera.io/calico/latest/operations/install-apiserver cat ./apiserver.yaml
    
    apiVersion: operator.tigera.io/v1
    kind: APIServer
    metadata:
    name: default
    spec: {}

apiVersion: operator.tigera.io/v1 kind: Installation metadata: name: default spec: {}


```bash
kubectl apply -f ./apiserver.yaml

Trigger the operator to start a migration by creating an Installation resource. The operator will auto-detect your existing Calico settings and fill out the spec section. It will create a IPPool default-ipv4-ippool

  1. Delete the default ippool
    kubectl delete ippools.projectcalico.org default-ipv4-ippool
MarkTopping commented 9 months ago

Please could I get some further clarification.

I'm using Azure AKS and I use their turnkey enablement of Calico Network Policy. This results in a number of 'crd.projectcalico.org' CRDs being installed into my clusters which are v1.

I've subsequently being deploying/using Calico [Global]NetworkPolicies and [Global]NetworkSets for quite some time. If I alter my manifests to target apiVersion /v3 then I get the error cited in this issue: "no matches for kind "X" in version "projectcalico.org/v3" when attempting to apply a resource"

I was about to raise a ticket with Microsoft to question why they are deploying v1 and not v3 while pointing them to this particular GitHub issue since it states in here that v1 is not supported - obviously concerning.

However before doing so, I also came across these manifests this morning which are provided for the current latest version of Calico (3.26.4): https://raw.githubusercontent.com/projectcalico/calico/v3.26.4/manifests/calico.yaml

I assume these manifests are provided are provided by the Tigera/Calico developers and I note that they are too versioned as v1.

I'm now confused. The guidance at the top of this issue states that I should not be using v1, but yet these manifests only make v1 available. And an AKS Cluster with Calico Network Policy enabled also deploys v1 definitions.

caseydavenport commented 9 months ago

@MarkTopping the v1 CRDs are installed on each cluster, but are just used internally by Calico. The reason v3 is recommended over v1 is that the v3 APIs are implemented using an extension API server (as described in this document: https://docs.tigera.io/calico/latest/operations/install-apiserver) which provides defaulting and validation as a layer on top of the crd.projectcalico.org resources.

The summary is:

100% agree this is confusing, and it's why I raised this issue. The fact that calico.yaml doesn't include the Calico API server by default makes this even more confusing, but it's worth noting that the primary install method via tigera-operator.yaml does install this API server by default which would make the v3 API available.

MarkTopping commented 9 months ago

@MarkTopping the v1 CRDs are installed on each cluster, but are just used internally by Calico. The reason v3 is recommended over v1 is that the v3 APIs are implemented using an extension API server (as described in this document: https://docs.tigera.io/calico/latest/operations/install-apiserver) which provides defaulting and validation as a layer on top of the crd.projectcalico.org resources.

The summary is:

  • crd.projectcalico.org/v1 CRDs are installed on every cluster but are an internal API used by Calico components and so don't provide the safeguards that projectcalico.org/v3 does.
  • To interact with the Calico projectcalico.org/v3 API, you can use either the extension API server described in the document above, or you can use the calicoctl CLI tool which performs that defaulting and validation client-side.

100% agree this is confusing, and it's why I raised this issue. The fact that calico.yaml doesn't include the Calico API server by default makes this even more confusing, but it's worth noting that the primary install method via tigera-operator.yaml does install this API server by default which would make the v3 API available.

Thanks for replying. Appreciated!

To be clear then; does the recommendation of installing the Calico API Server also apply to users of Azure AKS who offload the Calico installation to Microsoft when they enable the Calico NetPol feature?

I wonder could their be any implications to this? For example when Microsoft push down an Calico upgrade to customers then the API Server would not get updated at the same time and thus fall out of sync... is that safe for production clusters?

If the recommendation is for all Calico OS users to deploy the API Server, then I believe this detail to be missing from both Calico and Microsoft AKS documentation :(

caseydavenport commented 9 months ago

For example when Microsoft push down an Calico upgrade to customers then the API Server would not get updated at the same time and thus fall out of sync... is that safe for production clusters?

There is potential here that if a new API field is introduced, the old apiserver would not be aware of it. But the risk is low so long as you're not running in that state for an extended period of time.

does the recommendation of installing the Calico API Server also apply to users of Azure AKS who offload the Calico installation to Microsoft when they enable the Calico NetPol feature?

Really, AKS should be including the Calico API server as part of its offering and managing it as it does for other components. To be honest, I wasn't aware they weren't already. I think the "best" way in your scenario would be to use calicoctl, but obviously that's not ideal compared to being able to use kubectl, etc. I haven't tried installing the Calico API server on AKS, but I don't see why it wouldn't work (with the caveat mentioned above about keeping it in-sync with the version AKS installs).