tigera / operator

Kubernetes operator for installing Calico and Calico Enterprise
Apache License 2.0
186 stars 140 forks source link

Error running cluster on M1 / ARM Mac OS for local development #2929

Open MuhtasimTanmoy opened 1 year ago

MuhtasimTanmoy commented 1 year ago

I am currently following the steps outlined in this guide to create a local cluster for development purposes. However, I'm encountering errors when executing kind and kubectl, which are essential for creating and managing the cluster.

The errors I'm facing include:

Expected Behavior

The cluster should be created when I run make cluster-create, followed by interacting with it through kubectl

Current Behavior

The first step make cluster-create fails with the aforementioned errors.

Possible Solution

I suspect the issue might be related to compatibility with Apple Silicon (M1). In an attempt to resolve this, I made changes to the Makefile. It sort of got the cluster up and running though followed by an error in a later step:

I am explaining the rationale behind my changes.

kubectl

kind

This KUBECONFIG=./kubeconfig.yaml go run ./ --enable-leader-election=false command is not working as expected and gives the following error.

level":"error","ts":"2023-10-15T00:46:28+06:00","logger":"setup","msg":"problem running manager","error":"failed to wait for policy-recommendation-controller caches to sync: timed out waiting for cache to be synced"

Am I heading in the right direction or what should I do to run the cluster on Apple Silicon (M1)?

Steps to Reproduce (for bugs)

  1. Clone the repo
  2. Make sure docker is running
  3. Follow this guide, and try to run in Apple Silicon (M1)

Context

I am trying to make a local cluster for development purposes.

Your Environment

tmjd commented 1 year ago

If you're getting what looks like a functional cluster from make cluster-create then I think you're on the right path. After I run that command I get

> export KUBECONFIG=kubeconfig.yaml 
> kubectl get pods -A
NAMESPACE            NAME                                         READY   STATUS    RESTARTS   AGE
kube-system          coredns-558bd4d5db-4s6jh                     0/1     Pending   0          54s
kube-system          coredns-558bd4d5db-j26ws                     0/1     Pending   0          54s
kube-system          etcd-kind-control-plane                      1/1     Running   0          68s
kube-system          kube-apiserver-kind-control-plane            1/1     Running   0          68s
kube-system          kube-controller-manager-kind-control-plane   1/1     Running   0          68s
kube-system          kube-proxy-67d9l                             1/1     Running   0          35s
kube-system          kube-proxy-8h8v4                             1/1     Running   0          35s
kube-system          kube-proxy-rvw7f                             1/1     Running   0          54s
kube-system          kube-proxy-z8b9p                             1/1     Running   0          35s
kube-system          kube-scheduler-kind-control-plane            1/1     Running   0          68s
local-path-storage   local-path-provisioner-5545dd49d7-wvj9w      0/1     Pending   0          54s

Also if I look at the crds in the cluster I see the following. I'm including this because I'm wondering if they were not created as they should have been based on the error you've received.

> kubectl get crds | grep operator
amazoncloudintegrations.operator.tigera.io              2023-10-17T13:28:27Z
apiservers.operator.tigera.io                           2023-10-17T13:28:27Z
applicationlayers.operator.tigera.io                    2023-10-17T13:28:27Z
authentications.operator.tigera.io                      2023-10-17T13:28:27Z
compliances.operator.tigera.io                          2023-10-17T13:28:27Z
egressgateways.operator.tigera.io                       2023-10-17T13:28:27Z
imagesets.operator.tigera.io                            2023-10-17T13:28:27Z
installations.operator.tigera.io                        2023-10-17T13:28:27Z
intrusiondetections.operator.tigera.io                  2023-10-17T13:28:27Z
logcollectors.operator.tigera.io                        2023-10-17T13:28:27Z
logstorages.operator.tigera.io                          2023-10-17T13:28:27Z
managementclusterconnections.operator.tigera.io         2023-10-17T13:28:27Z
managementclusters.operator.tigera.io                   2023-10-17T13:28:27Z
managers.operator.tigera.io                             2023-10-17T13:28:27Z
monitors.operator.tigera.io                             2023-10-17T13:28:27Z
policyrecommendations.operator.tigera.io                2023-10-17T13:28:27Z
tenants.operator.tigera.io                              2023-10-17T13:28:27Z
tigerastatuses.operator.tigera.io                       2023-10-17T13:28:27Z
MuhtasimTanmoy commented 1 year ago

Hello @tmjd. I was able to fix the previous error and get to the exact state that you are in.

At that moment, the cluster nodes were in a NotReady state, and coredns pods were in a pending state due to not being able to get an IP Address from Pod Network as there was no CNI.

So, after installing the default custom resource with kubectl create -f ./config/samples/operator_v1_installation.yaml I have the following state.

NAMESPACE            NAME                                         READY   STATUS              RESTARTS   AGE
calico-system        calico-kube-controllers-6c6d97c87b-4bcdx     0/1     ContainerCreating   0          22m
calico-system        calico-node-2dtfx                            0/1     ImagePullBackOff    0          22m
calico-system        calico-node-j24j9                            0/1     ImagePullBackOff    0          22m
calico-system        calico-node-qz6xg                            0/1     ImagePullBackOff    0          22m
calico-system        calico-node-smnjn                            0/1     ImagePullBackOff    0          23m
calico-system        calico-typha-66cdfb85cf-qgw79                1/1     Running             0          23m
calico-system        calico-typha-66cdfb85cf-qks8w                1/1     Running             0          22m
calico-system        csi-node-driver-d2zsn                        0/2     ContainerCreating   0          22m
calico-system        csi-node-driver-lzqkb                        0/2     ContainerCreating   0          22m
calico-system        csi-node-driver-tcnlm                        0/2     ContainerCreating   0          22m
calico-system        csi-node-driver-z5ml9                        0/2     ContainerCreating   0          22m
kube-system          coredns-558bd4d5db-bpt8f                     0/1     ContainerCreating   0          27m
kube-system          coredns-558bd4d5db-cqvjj                     0/1     ContainerCreating   0          27m
kube-system          etcd-kind-control-plane                      1/1     Running             0          27m
kube-system          kube-apiserver-kind-control-plane            1/1     Running             0          27m
kube-system          kube-controller-manager-kind-control-plane   1/1     Running             0          27m
kube-system          kube-proxy-952lc                             1/1     Running             0          27m
kube-system          kube-proxy-jsphk                             1/1     Running             0          27m
kube-system          kube-proxy-nldmc                             1/1     Running             0          27m
kube-system          kube-proxy-vq5vm                             1/1     Running             0          27m
kube-system          kube-scheduler-kind-control-plane            1/1     Running             0          27m
local-path-storage   local-path-provisioner-778f7d66bf-dmknx      0/1     ContainerCreating   0          27m

Here the csi-node-driver, calico-kube-controllers and local-path-provisioner is currently wating for calico-node to be up and running. However, it is getting the ImagePullBackOff error.

Events from the pods show

>  kubectl get events -A  | grep -i calico-node-qz6xg

calico-system        27m         Normal    Scheduled                 pod/calico-node-qz6xg                           Successfully assigned calico-system/calico-node-qz6xg to kind-worker3
calico-system        27m         Normal    Pulling                   pod/calico-node-qz6xg                           Pulling image "docker.io/calico/pod2daemon-flexvol:master"
calico-system        26m         Normal    Pulled                    pod/calico-node-qz6xg                           Successfully pulled image "docker.io/calico/pod2daemon-flexvol:master" in 17.0363043s
calico-system        26m         Normal    Created                   pod/calico-node-qz6xg                           Created container flexvol-driver
calico-system        26m         Normal    Started                   pod/calico-node-qz6xg                           Started container flexvol-driver
calico-system        26m         Normal    Pulling                   pod/calico-node-qz6xg                           Pulling image "docker.io/calico/cni:master"
calico-system        24m         Normal    Pulled                    pod/calico-node-qz6xg                           Successfully pulled image "docker.io/calico/cni:master" in 2m26.246357483s
calico-system        24m         Normal    Created                   pod/calico-node-qz6xg                           Created container install-cni
calico-system        24m         Normal    Started                   pod/calico-node-qz6xg                           Started container install-cni
calico-system        22m         Normal    Pulling                   pod/calico-node-qz6xg                           Pulling image "docker.io/calico/node:master"
calico-system        22m         Warning   Failed                    pod/calico-node-qz6xg                           Failed to pull image "docker.io/calico/node:master": rpc error: code = NotFound desc = failed to pull and unpack image "docker.io/calico/node:master": no match for platform in manifest: not found
calico-system        23m         Warning   Failed                    pod/calico-node-qz6xg                           Error: ErrImagePull
calico-system        3m37s       Normal    BackOff                   pod/calico-node-qz6xg                           Back-off pulling image "docker.io/calico/node:master"
calico-system        22m         Warning   Failed                    pod/calico-node-qz6xg                           Error: ImagePullBackOff
calico-system        27m         Normal    SuccessfulCreate          daemonset/calico-node                           Created pod: calico-node-qz6xg

Specifically this error

Failed to pull image "docker.io/calico/node:master": rpc error: code = NotFound desc = failed to pull 
and unpack image "docker.io/calico/node:master": no match for platform in manifest: not found

So, what needs to be done to fix this when it is trying to fetch docker.io/calico/node:master?

Note that, docker pull docker.io/calico/node:latest works whereas node:master does not.

MuhtasimTanmoy commented 1 year ago

I was able to resolve the ImagePullBackOff error by making slight changes to the package/components/calico.go by replacing the version with "latest".

https://github.com/tigera/operator/blob/077f7483633898b43050b7cfcca9935eea34ebf0/pkg/components/calico.go#L56

So currently everything is up and running:

>  kubectl get pods -A

NAMESPACE            NAME                                         READY   STATUS    RESTARTS   AGE
calico-system        calico-kube-controllers-5c6d8778f5-btvp6     1/1     Running   0          17m
calico-system        calico-node-22lc7                            1/1     Running   0          17m
calico-system        calico-node-jvn5n                            1/1     Running   0          17m
calico-system        calico-node-lvm22                            1/1     Running   0          17m
calico-system        calico-node-qffqg                            1/1     Running   0          17m
calico-system        calico-typha-745f498dff-dn26b                1/1     Running   0          17m
calico-system        calico-typha-745f498dff-flbw6                1/1     Running   0          17m
calico-system        csi-node-driver-4h6w5                        2/2     Running   0          60s
calico-system        csi-node-driver-59gb8                        2/2     Running   0          17m
calico-system        csi-node-driver-b5c29                        2/2     Running   0          17m
calico-system        csi-node-driver-gzvtg                        2/2     Running   0          17m
kube-system          coredns-558bd4d5db-pz5fc                     1/1     Running   0          21m
kube-system          coredns-558bd4d5db-wrc7c                     1/1     Running   0          21m
kube-system          etcd-kind-control-plane                      1/1     Running   0          21m
kube-system          kube-apiserver-kind-control-plane            1/1     Running   0          21m
kube-system          kube-controller-manager-kind-control-plane   1/1     Running   0          21m
kube-system          kube-proxy-mkdbs                             1/1     Running   0          21m
kube-system          kube-proxy-pd7b2                             1/1     Running   0          20m
kube-system          kube-proxy-q4hq8                             1/1     Running   0          20m
kube-system          kube-proxy-sgx4l                             1/1     Running   0          20m
kube-system          kube-scheduler-kind-control-plane            1/1     Running   0          21m
local-path-storage   local-path-provisioner-778f7d66bf-44cw5      1/1     Running   0          21m

So, in summary, to set up the cluster in Apple Silicon (M1) I needed to make the following three changes.

  1. In the makefile, change the following line to include $(BUILDOS)/$(ARCH)/kubectl to make it OS-independent. https://github.com/tigera/operator/blob/077f7483633898b43050b7cfcca9935eea34ebf0/Makefile#L271
  2. In the makefile, change the installation of the kind binary to sh -c "GOBIN=$(CURDIR)/$(BINDIR) go install sigs.k8s.io/kind" only. https://github.com/tigera/operator/blob/077f7483633898b43050b7cfcca9935eea34ebf0/Makefile#L277
  3. In the package/components/calico.go file, change the version as described above.

Would you briefly direct if these changes need to be reflected in the source via pull request to support local development on M1 or if this issue needs other approaches due to some side effects?

tmjd commented 1 year ago

I expect 1 and 2 would be fine. 3 wouldn't be ideal because I think latest is probably the 'latest' released images which would not be the same as using master images. I'm guessing the issue is that only the amd64 images are built and pushed for master builds so the arm images are not available.

MuhtasimTanmoy commented 1 year ago

For 3, yes, the 'latest' released image may cause issues compared to stable and tested 'master' images. But as this docker.io/calico/node:master image is unavailable for the arm what should be the workaround as this is a blocker for creating a cluster?

Additionally, is a pull request needed with changes made in 1 and 2?

tmjd commented 7 months ago

Sorry there hasn't been any response here for a while.

For 1: If you want to make the suggested change that would be good.

For 2: I think you are suggesting switch to

sh -c "GOBIN=$(CURDIR)/$(BINDIR) go install sigs.k8s.io/kind"

I don't think that is something we would want in general, since it would no longer be containerized which is something we want to maintain. I'd be ok with a conditional based on BUILDOS, perhaps if BUILDOS != linux then instruct user to copy a functional kind binary to $(BINDIR)/kind

For 3: You could request the projectcalico/calico to push the node image for arm on master builds. Another option would be to have a make target that switches the versions to ease creating a build with latest (or some other tag). Maybe something like the following

set-calico-version:
  sed -i -e "s/version: .*$$/version: $(VERSION)/" config/calico_versions.yml
  make gen-versions-calico
MuhtasimTanmoy commented 7 months ago

For 1: ok

For 2:

I don't think that is something we would want in general, since it would no longer be containerized which is something we want to maintain. I'd be ok with a conditional based on BUILDOS, perhaps if BUILDOS != linux then instruct user to copy a functional kind binary to $(BINDIR)/kind

As the kind binary is being used to create local cluster on host machine rather then in a container as given below, should the binary be containerized?

Though might miss some cornercases.

## Create a local kind dual stack cluster.
KIND_KUBECONFIG?=./kubeconfig.yaml
K8S_VERSION?=v1.21.14
cluster-create: $(BINDIR)/kubectl $(BINDIR)/kind
    # First make sure any previous cluster is deleted
    make cluster-destroy

    # Create a kind cluster.
    $(BINDIR)/kind create cluster \
            --config ./deploy/kind-config.yaml \
            --kubeconfig $(KIND_KUBECONFIG) \
            --image kindest/node:$(K8S_VERSION)

Does this look ok in the case of conditional based on BUILDOS? (tested on darwin)

$(BINDIR)/kind:
ifeq ($(BUILDOS), darwin)
    sh -c "GOBIN=/go/src/$(PACKAGE_NAME)/$(BINDIR) go install sigs.k8s.io/kind"
else
    $(CONTAINERIZED) $(CALICO_BUILD) sh -c "GOBIN=/go/src/$(PACKAGE_NAME)/$(BINDIR) go install sigs.k8s.io/kind"
endif

For 3: Added the following due to this issue with sed.

# https://stackoverflow.com/questions/4247068/sed-command-with-i-option-failing-on-mac-but-works-on-linux/4247319#4247319
set-calico-version:
ifeq ($(BUILDOS), darwin)
    sed -i '' -e 's/version: .*/version: $(VERSION)/' config/calico_versions.yml
else
    sed -i -e 's/version: .*/version: $(VERSION)/' config/calico_versions.yml
endif
    make gen-versions-calico

Should go with the following changes?

tmjd commented 6 months ago

I'm good with what you're suggesting For 2, though I'll point out that I don't think you should include GOBIN in the command.

Seems reasonable for 3 also.

MuhtasimTanmoy commented 6 months ago

On another thought, shouldn't adopting nix would solve compatibility issues altogether? Reference: Using Nix with Dockerfiles

MuhtasimTanmoy commented 6 months ago

I'll point out that I don't think you should include GOBIN in the command

$(BINDIR)/kind:
ifeq ($(BUILDOS), darwin)
    sh -c go install sigs.k8s.io/kind"
else
    $(CONTAINERIZED) $(CALICO_BUILD) sh -c go install sigs.k8s.io/kind"
endif

Like this?

I will give a PR with these fix then.

tmjd commented 6 months ago

I'd guess there is probably no need for the sh -c either. (you've got a trailing " that you'll need to get rid of too)

On another thought, shouldn't adopting nix would solve compatibility issues altogether?

I'm not sure, we still need to build kind that can work on darwin or linux, does nix help with that?

MuhtasimTanmoy commented 6 months ago

Does this look ok?

$(BINDIR)/kind:
ifeq ($(BUILDOS), darwin)
   go install sigs.k8s.io/kind
else
    $(CONTAINERIZED) $(CALICO_BUILD) go install sigs.k8s.io/kind
endif

I'm not sure, we still need to build kind that can work on darwin or linux, does nix help with that?

Being universal, it should. I have used it for consistent environment for building Docker images.

tmjd commented 6 months ago

That does not look ok, I didn't notice you were modifying the "non-darwin" command, it should remain what it has been.

Have you tried what you're suggesting for the "darwin" option? It doesn't look like it would work to me. The result of the commands should result in a kind binary (that works on the host system) at $(BINDIR)/kind. You probably do need a GOBIN but it would be different from the "non-darwin" command.

Please put up a PR that you've tested, ensure you run make clean before testing to make sure that you don't have any binaries that would make it look like everything is working.

Being universal, it should. I have used it for consistent environment for building Docker images.

But this is not building a Docker image, we're installing a binary that is used. So I don't understand how using nix would help us fetch a darwin binary on darwin and a linux binary on linux.