hyperspike / valkey-operator

A Kubernetes Operator to deploy and manage Valkey Clusters
https://hyperspike.io
Apache License 2.0
35 stars 2 forks source link

Unable to start valkey cluster on minikube #9

Closed arpan57 closed 1 month ago

arpan57 commented 1 month ago

Thank you for the initiative first of all.

Here are the steps I have followed.

git clone https://github.com/hyperspike/valkey-operator.git cd valkey-operator make docker-build make install ( customresourcedefinition.apiextensions.k8s.io/valkeys.hyperspike.io created) Stopped my existing minikube session $ make minikube It created a new kubectx - north $ kubectx north It did create the valkey-operator-system namespace and three pods. However, all the three pods do not run/ stuck in FailedScheduling status

Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  2m7s (x2 over 2m8s)  default-scheduler  0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  118s                 default-scheduler  0/5 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  1s (x2 over 97s)     default-scheduler  0/6 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
dmolik commented 1 month ago

Given that you're on a mac I'd try a default minikube install. and when you're installing the operator, try using the dist file.

kubectl apply -f https://raw.githubusercontent.com/hyperspike/valkey-operator/main/dist/install.yaml

arpan57 commented 1 month ago

I deleted all the minkube profile and started from the scratch to avoid any left overs.

I ran kubectl apply -f https://raw.githubusercontent.com/hyperspike/valkey-operator/main/dist/install.yaml I can see the valkey-operator-system and its pod running successfully. The valkey-sample-n pods are stuck in a crashback loop.

 k describe pod valkey-sample-0
...
Events:
  Type     Reason            Age                     From               Message
  ----     ------            ----                    ----               -------
  Warning  FailedScheduling  13m                     default-scheduler  0/3 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
  Normal   Scheduled         13m                     default-scheduler  Successfully assigned default/valkey-sample-0 to north-m02
  Normal   Pulling           13m                     kubelet            Pulling image "docker.io/bitnami/valkey-cluster:7.2.5-debian-12-r4"
  Normal   Pulled            13m                     kubelet            Successfully pulled image "docker.io/bitnami/valkey-cluster:7.2.5-debian-12-r4" in 35.808s (35.808s including waiting). Image size: 172964760 bytes.
  Warning  Unhealthy         12m (x5 over 12m)       kubelet            Liveness probe failed: Could not connect to Valkey at localhost:6379: Connection refused
  Normal   Killing           12m                     kubelet            Container valkey failed liveness probe, will be restarted
  Normal   Created           12m (x2 over 13m)       kubelet            Created container valkey
  Normal   Started           12m (x2 over 13m)       kubelet            Started container valkey
  Warning  Unhealthy         12m                     kubelet            Readiness probe failed:
  Normal   Pulled            12m                     kubelet            Container image "docker.io/bitnami/valkey-cluster:7.2.5-debian-12-r4" already present on machine
  Warning  Unhealthy         8m37s (x57 over 12m)    kubelet            Readiness probe failed: Could not connect to Valkey at localhost:6379: Connection refused
  Warning  BackOff           3m27s (x17 over 7m32s)  kubelet            Back-off restarting failed container valkey in pod valkey-sample-0_default(a9b33087-3f1f-4dda-88c8-005bc236d001)
dmolik commented 1 month ago

this is most likely due to minikube storage provider not supporting non-root access https://github.com/kubernetes/minikube/issues/1990

you can try applying the storage hack in scripts/ kubectl apply -f scripts/minikube-pvc-hack.yaml

dmolik commented 1 month ago

The make minikube should now work much better on macros.

Container images should now properly download (they're public now)

And the storage hack is part of the startup script

make minikube
kubectl apply -f https://github.com/hyperspike/valkey-operator/dist/install.yaml
arpan57 commented 1 month ago

I have to say - this time it was much smoother.

By https://github.com/hyperspike/valkey-operator/dist/install.yaml Did you mean : https://github.com/hyperspike/valkey-operator/blob/main/dist/install.yaml ? ( Resource with the mentioned link couldn't found) Instead I executed : kubectl apply -f /path/to/valkey-operator/dist/install.yaml - which seemed to work

Post that ran samples kubectl apply -k config/samples/

However, the containers don't start completely.

> k get pods
NAME                                   READY   STATUS    RESTARTS      AGE
prometheus-operator-7b87d59796-f95zc   1/1     Running   0             18m
prometheus-prometheus-0                2/2     Running   0             17m
valkey-sample-0                        0/1     Running   4 (50s ago)   5m24s
valkey-sample-1                        0/1     Running   5 (5s ago)    5m24s
valkey-sample-2                        0/1     Running   4 (35s ago)   5m24s

The logs from an example pod

❯ k logs valkey-sample-0 -f
valkey-cluster 12:37:01.53 INFO  ==>
valkey-cluster 12:37:01.53 INFO  ==> Welcome to the Bitnami valkey-cluster container
valkey-cluster 12:37:01.53 INFO  ==> Subscribe to project updates by watching https://github.com/bitnami/containers
valkey-cluster 12:37:01.53 INFO  ==> Submit issues and feature requests at https://github.com/bitnami/containers/issues
valkey-cluster 12:37:01.53 INFO  ==> Upgrade to Tanzu Application Catalog for production environments to access custom-configured and pre-packaged software components. Gain enhanced features, including Software Bill of Materials (SBOM), CVE scan result reports, and VEX documents. To learn more, visit https://bitnami.com/enterprise
valkey-cluster 12:37:01.53 INFO  ==>
valkey-cluster 12:37:01.54 INFO  ==> ** Starting Valkey setup **
valkey-cluster 12:37:01.55 INFO  ==> Initializing Valkey
valkey-cluster 12:37:01.55 INFO  ==> Setting Valkey config file

Events part :

Events:
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  6m50s                 default-scheduler  0/1 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
  Normal   Scheduled         6m48s                 default-scheduler  Successfully assigned default/valkey-sample-1 to north
  Warning  FailedMount       6m47s                 kubelet            MountVolume.SetUp failed for volume "scripts" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount       6m47s                 kubelet            MountVolume.SetUp failed for volume "valkey-conf" : failed to sync configmap cache: timed out waiting for the condition
  Normal   Pulling           6m46s                 kubelet            Pulling image "docker.io/bitnami/valkey-cluster:7.2.6-debian-12-r0"
  Normal   Pulled            6m8s                  kubelet            Successfully pulled image "docker.io/bitnami/valkey-cluster:7.2.6-debian-12-r0" in 37.931s (37.931s including waiting). Image size: 172961168 bytes.
  Normal   Created           6m8s                  kubelet            Created container valkey
  Normal   Started           6m8s                  kubelet            Started container valkey
  Warning  Unhealthy         5m41s (x5 over 6m1s)  kubelet            Liveness probe failed: Could not connect to Valkey at localhost:6379: Connection refused
  Normal   Killing           5m41s                 kubelet            Container valkey failed liveness probe, will be restarted
  Normal   Pulled            5m11s                 kubelet            Container image "docker.io/bitnami/valkey-cluster:7.2.6-debian-12-r0" already present on machine
  Warning  Unhealthy         106s (x57 over 6m1s)  kubelet            Readiness probe failed: Could not connect to Valkey at localhost:6379: Connection refused
dmolik commented 1 month ago

interesting, what's the output of k logs valkey-sample-1 -f

arpan57 commented 1 month ago

its :

❯ k logs valkey-sample-1 -f
valkey-cluster 21:33:08.52 INFO  ==>
valkey-cluster 21:33:08.52 INFO  ==> Welcome to the Bitnami valkey-cluster container
valkey-cluster 21:33:08.52 INFO  ==> Subscribe to project updates by watching https://github.com/bitnami/containers
valkey-cluster 21:33:08.52 INFO  ==> Submit issues and feature requests at https://github.com/bitnami/containers/issues
valkey-cluster 21:33:08.52 INFO  ==> Upgrade to Tanzu Application Catalog for production environments to access custom-configured and pre-packaged software components. Gain enhanced features, including Software Bill of Materials (SBOM), CVE scan result reports, and VEX documents. To learn more, visit https://bitnami.com/enterprise
valkey-cluster 21:33:08.52 INFO  ==>
valkey-cluster 21:33:08.52 INFO  ==> ** Starting Valkey setup **
valkey-cluster 21:33:08.54 INFO  ==> Initializing Valkey
valkey-cluster 21:33:08.57 INFO  ==> Setting Valkey config file
dmolik commented 1 month ago

hmmm, the PV hack may not work with Macs, it might be a good time to try and build out the root-ful mode

arpan57 commented 1 month ago

Not clear how would I build in root-ful mode.

dmolik commented 1 month ago

I did a little research and went with an initContainer to set the PVC file permissions, changes have been released to v0.0.8 and can be used like so:

apiVersion: hyperspike.io/v1
kind: Valkey
metadata:
  name: keyval
spec:
  volumePermissions: true

to leverage an existing deployment you will need to delete all vk deployments and upgrade the controller:

kubectl apply -f https://raw.githubusercontent.com/hyperspike/valkey-operator/main/dist/install.yaml
arpan57 commented 1 month ago

I pulled the latest from git.

Deleted the north namespace/minikube profile minikube delete -p north

Noticed that valkey.yaml in the root looks similar to you have mentioned.

❯ cat valkey.yaml apiVersion: hyperspike.io/v1 kind: Valkey metadata: labels: app.kubernetes.io/name: valkey-operator app.kubernetes.io/managed-by: kustomize name: keyval spec: volumePermissions: true

Also applied the kubectl apply -f https://raw.githubusercontent.com/hyperspike/valkey-operator/main/dist/install.yaml

I am still facing the same issue on this machine.Logs are same.

From events:

  Normal   Pulling           68m                   kubelet            Pulling image "docker.io/bitnami/valkey-cluster:7.2.6-debian-12-r0"
  Normal   Pulled            67m                   kubelet            Successfully pulled image "docker.io/bitnami/valkey-cluster:7.2.6-debian-12-r0" in 11.405s (59.853s including waiting). Image size: 172961168 bytes.
  Normal   Created           67m                   kubelet            Created container valkey
  Normal   Started           67m                   kubelet            Started container valkey
  Normal   Killing           66m                   kubelet            Container valkey failed liveness probe, will be restarted
  Warning  Unhealthy         7m21s (x17 over 67m)  kubelet            Liveness probe failed: Could not connect to Valkey at localhost:6379: Connection refused
  Warning  Unhealthy         2m20s (x79 over 67m)  kubelet            Readiness probe failed: Could not connect to Valkey at localhost:6379: Connection refused

From the controller logs if it helps

2024-08-07T17:09:23Z    ERROR   failed to create valkey client  {"controller": "valkey", "controllerGroup": "hyperspike.io", "controllerKind": "Valkey", "Valkey": {"name":"keyval","namespace":"default"}, "namespace": "default", "name": "keyval", "reconcileID": "bda25c86-fadd-4b6b-ad94-e649576edfc5", "valkey": "keyval", "namespace": "default", "error": "dial tcp: lookup keyval-0.keyval-headless.default.svc: i/o timeout"}
hyperspike.io/valkey-operator/internal/controller.(*ValkeyReconciler).balanceNodes
    internal/controller/valkey_controller.go:702
hyperspike.io/valkey-operator/internal/controller.(*ValkeyReconciler).Reconcile
    internal/controller/valkey_controller.go:160
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.4/pkg/internal/controller/controller.go:261
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
dmolik commented 1 month ago

Hmmmm, I wonder if the liveness and readiness probes are simply expiring before the daemon comes up. Can you try bumping the failure threshold from 5 to 25?

arpan57 commented 1 month ago

Tried with following for both LivenessProbe and ReadinessProbe

InitialDelaySeconds=30s
FailureThreshold:    25,

no changes in the results.

dmolik commented 1 month ago

@arpan57 are we good to close?

arpan57 commented 1 month ago

I am going to try it on k8s cluster instead of minikube. TBH it has worked on one of my personal macbooks, but not other. I would shelf the issue for now. Thank you for all the follow up.