vertica / vertica-kubernetes

Operator, container and Helm chart to deploy Vertica in Kubernetes
Apache License 2.0
44 stars 25 forks source link

CreateContainerError occurs when trying to use VerticaAutoscaler #908

Closed cyun79 closed 2 months ago

cyun79 commented 2 months ago

I'm trying to implement VerticaAutoscaler, but it doesn't work. Could anyone give me some advice?

Before generate load

[mini@vmhost ~]$ k get all

NAME                           READY   STATUS    RESTARTS   AGE
pod/vertica-eon-k8s-pri-01-0   3/3     Running   0          13m
pod/vertica-eon-k8s-pri-01-1   3/3     Running   0          13m
pod/vertica-eon-k8s-pri-01-2   3/3     Running   0          13m

NAME                                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                               AGE
service/kubernetes                        ClusterIP   10.96.0.1       <none>        443/TCP                               5h27m
service/vertica-eon-k8s                   ClusterIP   None            <none>        5434/TCP,4803/TCP,8443/TCP,5554/TCP   13m
service/vertica-eon-k8s-vdb-connections   ClusterIP   10.108.196.30   <none>        5433/TCP,8443/TCP                     13m

NAME                                      READY   AGE
statefulset.apps/vertica-eon-k8s-pri-01   3/3     13m

NAME                                         REFERENCE                  TARGETS       MINPODS   MAXPODS   REPLICAS   AGE
horizontalpodautoscaler.autoscaling/vas-01   VerticaAutoscaler/vas-01   cpu: 0%/10%   3         12        3          17s

NAME                                    SUBCLUSTERS   VERSION     READY   AGE
verticadb.vertica.com/vertica-eon-k8s   1             v24.2.0-1   3/3     13m

NAME                                   GRANULARITY   CURRENT SIZE   TARGET SIZE   SCALING COUNT   AGE
verticaautoscaler.vertica.com/vas-01   Pod           3              3             0               21s

[mini@vmhost ~]$ k top pods

NAME                       CPU(cores)   MEMORY(bytes)   
vertica-eon-k8s-pri-01-0   12m          804Mi           
vertica-eon-k8s-pri-01-1   12m          713Mi           
vertica-eon-k8s-pri-01-2   12m          717Mi   

[mini@vmhost ~]$ kd hpa

Name:                                                  vas-01
Namespace:                                             default
Labels:                                                <none>
Annotations:                                           <none>
CreationTimestamp:                                     Thu, 29 Aug 2024 18:41:16 +0900
Reference:                                             VerticaAutoscaler/vas-01
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  0% (14m) / 10%
Min replicas:                                          3
Max replicas:                                          12
VerticaAutoscaler pods:                                3 current / 3 desired
Conditions:
  Type            Status  Reason               Message
  ----            ------  ------               -------
  AbleToScale     True    ScaleDownStabilized  recent recommendations were higher than current one, applying the highest recent recommendation
  ScalingActive   True    ValidMetricFound     the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
  ScalingLimited  False   DesiredWithinRange   the desired count is within the acceptable range
Events:           <none>

After generate load

[mini@vmhost ~]$ k top pods

NAME                       CPU(cores)   MEMORY(bytes)   
vertica-eon-k8s-pri-01-0   983m         807Mi           
vertica-eon-k8s-pri-01-1   17m          709Mi           
vertica-eon-k8s-pri-01-2   18m          713Mi   

[mini@vmhost ~]$ kd hpa

Name:                                                  vas-01
Namespace:                                             default
Labels:                                                <none>
Annotations:                                           <none>
CreationTimestamp:                                     Thu, 29 Aug 2024 18:41:16 +0900
Reference:                                             VerticaAutoscaler/vas-01
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  16% (340m) / 10%
Min replicas:                                          3
Max replicas:                                          12
VerticaAutoscaler pods:                                3 current / 5 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    SucceededRescale    the HPA controller was able to update the target scale to 5
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:
  Type    Reason             Age   From                       Message
  ----    ------             ----  ----                       -------
  Normal  SuccessfulRescale  5s    horizontal-pod-autoscaler  New size: 5; reason: cpu resource utilization (percentage of request) above target

[mini@vmhost ~]$ k get pods

NAME                       READY   STATUS              RESTARTS   AGE
vertica-eon-k8s-pri-01-0   0/2     ContainerCreating   0          2s
vertica-eon-k8s-pri-01-1   0/2     ContainerCreating   0          2s
vertica-eon-k8s-pri-01-2   0/2     ContainerCreating   0          2s

[mini@vmhost ~]$ k get pods NAME READY STATUS RESTARTS AGE

vertica-eon-k8s-pri-01-0   1/2     CreateContainerError   0          19s
vertica-eon-k8s-pri-01-1   1/2     CreateContainerError   0          19s
vertica-eon-k8s-pri-01-2   1/2     CreateContainerError   0          19s

Operator shows below error

{"log":"2024-08-29T09:46:42.606Z\u0009ERROR\u0009Reconciler error\u0009{\"controller\": \"verticadb\", \"controllerGroup\": \"vertica.com\", \"controllerKind\": \"VerticaDB\", \"VerticaDB\": {\"name\":\"vertica-eon-k8s\",\"namespace\":\"default\"}, \"namespace\": \"default\", \"name\": \"vertica-eon-k8s\", \"reconcileID\": \"e385718a-7945-431d-8b99-90178d645e75\", \"error\": \"failed to copy and execute the gather script: could not execute: unable to upgrade connection: pod does not exist\", \"errorVerbose\": \"could not execute: unable to upgrade connection: pod does not exist\\nfailed to copy and execute the gather script\\ngithub.com/vertica/vertica-kubernetes/pkg/controllers/vdb.(*PodFacts).runGather\\n\\t/workspace/pkg/controllers/vdb/podfacts.go:457\\ngithub.com/vertica/vertica-kubernetes/pkg/controllers/vdb.(*PodFacts).collectPodByStsIndex\\n\\t/workspace/pkg/controllers/vdb/podfacts.go:420\\ngithub.com/vertica/vertica-kubernetes/pkg/controllers/vdb.(*PodFacts).collectSubcluster\\n\\t/workspace/pkg/controllers/vdb/podfacts.go:339\\ngithub.com/vertica/vertica-kubernetes/pkg/controllers/vdb.(*PodFacts).Collect\\n\\t/workspace/pkg/controllers/vdb/podfacts.go:282\\ngithub.com/vertica/vertica-kubernetes/pkg/controllers/vdb.(*AnnotateAndLabelPodReconciler).Reconcile\\n\\t/workspace/pkg/controllers/vdb/annotateandlabelpod_reconciler.go:56\\ngithub.com/vertica/vertica-kubernetes/pkg/controllers/vdb.(*VerticaDBReconciler).Reconcile\\n\\t/workspace/pkg/controllers/vdb/verticadb_controller.go:135\\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\\n\\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:122\\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\\n\\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:323\\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\\n\\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:274\\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\\n\\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:235\\nruntime.goexit\\n\\t/usr/local/go/src/runtime/asm_amd64.s:1695\"}\n","stream":"stdout","time":"2024-08-29T09:46:42.693813185Z"}

My envrionment

[mini@vmhost ~]$ kubectl version

Client Version: v1.31.0
Kustomize Version: v5.4.2
Server Version: v1.30.0

[mini@vmhost ~]$ k api-resources | grep -i vertica

eventtriggers                       et           vertica.com/v1beta1                true         EventTrigger
verticaautoscalers                  vas          vertica.com/v1beta1                true         VerticaAutoscaler
verticadbs                          vdb          vertica.com/v1                     true         VerticaDB
verticareplicators                  vrep         vertica.com/v1beta1                true         VerticaReplicator
verticarestorepointsqueries         vrpq         vertica.com/v1beta1                true         VerticaRestorePointsQuery
verticascrutinizers                 vscr         vertica.com/v1beta1                true         VerticaScrutinize

# cat vertica.yml

apiVersion: vertica.com/v1
kind: VerticaDB
metadata:
  name: "vertica-eon-k8s"
  annotations:
    vertica.com/include-uid-in-path: "true"
    vertica.com/superuser-name: vertica
spec:
  sidecars:
    - name: vlogger
      image: opentext/vertica-logger:1.0.1
      resources:
        requests:
          memory: "100Mi"
          cpu: "100m"
        limits:
          memory: "100Mi"
          cpu: "100m"
  communal:
    path: "s3://vertica-data-k8s"
    endpoint: http://192.168.0.26:9000
    credentialSecret: s3-creds
    region: "us-east-1"
  image: opentext/vertica-k8s:24.2.0-1-minimal
  imagePullPolicy: Always
  imagePullSecrets:
  - name: regcreds  
  dbName: eon_k8s
  local:
    requestSize: 10Gi
  subclusters:
  - name: pri_01
    serviceName: vdb-connections
    resources:
      requests:
        cpu: 1
        memory: 2G
      limits:
        cpu: 1
        memory: 2G
    size: 3
  shardCount: 3
  licenseSecret: vertica-license

# cat vas.yml

apiVersion: vertica.com/v1beta1
kind: VerticaAutoscaler
metadata:
  name: vas-01
  namespace: default
spec:
  scalingGranularity: Pod
  #scalingGranularity: Subcluster
  serviceName: vdb-connections
  verticaDBName: vertica-eon-k8s
roypaulin commented 2 months ago

Can you share the following:

cyun79 commented 2 months ago

I attached the files that you requested. Thanks, e466f7e86550d36a2cd9ebb82d87fa1ce117c44080e8fbd17ea69fdf5b012079-json.log vdb_vertica-eon-k8s.yml.log pod_vertica-eon-k8s-pri-01-0.yml.log

roypaulin commented 2 months ago

Thanks! The issue is that at some point during the autoscaling process the annotation vertica.com/vcluster-ops was set to false when the default value should be and stay true. I am going take a look but as a temporary fix can you explicitly set the annotation vertica.com/vcluster-ops to true before deploying vertica(in vertica.yaml)? This way:

annotations:
    vertica.com/vcluster-ops: "true"
...

Let me know how it goes.

cyun79 commented 2 months ago

Thank you for your guide. I tried that, and there's no CreateContainerError. However, the additional nodes are never fully ready. The parameters scalingGranularity (Pod/Subcluster) both produced the same results.

The results below are from when the scalingGranularity was set to "Subcluster".

[mini@vmhost ~]$ k get pods
NAME                         READY   STATUS    RESTARTS        AGE
vertica-eon-k8s-pri-01-0     3/3     Running   0               26m
vertica-eon-k8s-pri-01-1     3/3     Running   0               26m
vertica-eon-k8s-pri-01-2     3/3     Running   0               26m
vertica-eon-k8s-vas-01-0-0   2/3     Running   1 (2m41s ago)   22m
vertica-eon-k8s-vas-01-0-1   2/3     Running   1 (2m30s ago)   22m
vertica-eon-k8s-vas-01-0-2   0/3     Pending   0               22m

As you can see, the pod 0 and 1 are stuck after "Starting HTTP listener on address :5554" and pod 2 wasn't started.

[mini@vmhost ~]$ k logs pod/vertica-eon-k8s-vas-01-0-0 -f
Defaulted container "nma" out of: nma, server, vlogger
2024/08/30 15:26:33 New NodeManagementAgent starting
2024/08/30 15:26:33 Checking for existence of directory  /opt/vertica/log
2024/08/30 15:26:33 Moving working directory to  /opt/vertica/log
2024/08/30 15:26:33 Successfully opened file /proc/1/fd/1. Setting log output to that file.
2024/08/30 15:26:33 New log for process  1
2024/08/30 15:26:33 Called with args  [/opt/vertica/bin/node_management_agent]
2024/08/30 15:26:33 Hostname vertica-eon-k8s-vas-01-0-0 User id 5000
2024/08/30 15:26:33 Verbose logging is off
2024/08/30 15:26:33 Checking for existence of directory  /opt/vertica/config
2024/08/30 15:26:33 Creating pid file named  /opt/vertica/config/node_management_agent.pid
2024/08/30 15:26:33 [Info]: Initializing TLS configuration for HTTPS listener.
2024/08/30 15:26:33 [Info]: Secrets retrieval from k8s based secret store
2024/08/30 15:26:33 [Info]: Secret name not set in env. Failback to other cert retieval methods.
2024/08/30 15:26:33 [Info]: Using paths to PEM files from environment variables.
2024/08/30 15:26:33 [Info]: Writing paths to PEM files from environment variables to cache.
2024/08/30 15:26:33 [Warning]: Failed to write cache file /opt/vertica/config/https_certs/tls_path_cache.yaml. Ignoring this error and continuing: error in writing yaml file /opt/vertica/config/https_certs/tls_path_cache.yaml: open /opt/vertica/config/https_certs/tls_path_cache.yaml: no such file or directory
2024/08/30 15:26:33 [Info]: Added CA certificate(s) to trusted pool.
2024/08/30 15:26:33 [Info]: Initializing TLS configuration finished.
2024/08/30 15:26:33 Starting HTTP listener on address :5554
[mini@vmhost ~]$ k logs pod/vertica-eon-k8s-vas-01-0-1 -f
Defaulted container "nma" out of: nma, server, vlogger
2024/08/30 15:26:34 New NodeManagementAgent starting
2024/08/30 15:26:34 Checking for existence of directory  /opt/vertica/log
2024/08/30 15:26:34 Moving working directory to  /opt/vertica/log
2024/08/30 15:26:34 Successfully opened file /proc/1/fd/1. Setting log output to that file.
2024/08/30 15:26:34 New log for process  1
2024/08/30 15:26:34 Called with args  [/opt/vertica/bin/node_management_agent]
2024/08/30 15:26:34 Hostname vertica-eon-k8s-vas-01-0-1 User id 5000
2024/08/30 15:26:34 Verbose logging is off
2024/08/30 15:26:34 Checking for existence of directory  /opt/vertica/config
2024/08/30 15:26:34 Creating pid file named  /opt/vertica/config/node_management_agent.pid
2024/08/30 15:26:34 [Info]: Initializing TLS configuration for HTTPS listener.
2024/08/30 15:26:34 [Info]: Secrets retrieval from k8s based secret store
2024/08/30 15:26:34 [Info]: Secret name not set in env. Failback to other cert retieval methods.
2024/08/30 15:26:34 [Info]: Using paths to PEM files from environment variables.
2024/08/30 15:26:34 [Info]: Writing paths to PEM files from environment variables to cache.
2024/08/30 15:26:34 [Warning]: Failed to write cache file /opt/vertica/config/https_certs/tls_path_cache.yaml. Ignoring this error and continuing: error in writing yaml file /opt/vertica/config/https_certs/tls_path_cache.yaml: open /opt/vertica/config/https_certs/tls_path_cache.yaml: no such file or directory
2024/08/30 15:26:34 [Info]: Added CA certificate(s) to trusted pool.
2024/08/30 15:26:34 [Info]: Initializing TLS configuration finished.
2024/08/30 15:26:34 Starting HTTP listener on address :5554
[mini@vmhost ~]$  k logs pod/vertica-eon-k8s-vas-01-0-2 -f
Defaulted container "nma" out of: nma, server, vlogger

I attached the operator log file. verticadb-operator-manager-5f7db8557b-6gcnv.log

roypaulin commented 2 months ago

The issue is that the operator is waiting for all the new pods to be running before adding them to the database but one of them is stuck pending. They are several reasons why a pod can be "Pending": Insufficient resources in the k8s cluster(CPU/Mem), pod quotas or limits, (k8s cluster) node availability... It is difficult to remotely what might be the issue as the k8s cluster is yours. Are you sure your cluster has enough resources? Share the output of these commands:

cyun79 commented 2 months ago

@roypaulin I really appreciate your advice, and it worked after adjusting the CPU for CR.