openebs / mayastor

Dynamically provision Stateful Persistent Replicated Cluster-wide Fabric Volumes & Filesystems for Kubernetes that is provisioned from an optimized NVME SPDK backend data storage stack.
Apache License 2.0
726 stars 105 forks source link

Mayastor v2.1.0 Helmchart installation fails #1368

Closed Titus-von-Koeller closed 1 year ago

Titus-von-Koeller commented 1 year ago

First of all, thank you so much for your amazing work, it's greatly appreciated! Let me know if there is any info missing or if there is anything else I can help out with.

Bug description The issue is related to a failed Helmchart-based install of Mayastor v2.1.0 on a Talos Linux-based on-prem Kubernetes cluster with no pre-existing storage classes. The issue is unlikely to be related to Talos Linux, as I have afterwards successfully installed Maystor 2.0.1 on the same cluster, using the exact same steps as outlined below.

Expected behavior The expected behavior would be a clean and fully working Helmchart-based install of Mayastor v2.1.0 on the Talos Linux-based on-prem Kubernetes cluster with no pre-existing storage classes.

OS info

To Reproduce Steps to reproduce the behavior (following the official install instructions):

  1. Optional (bug doesn't seem Talos-related) - Patch Talos worker node config:

Using gen config (if you're generating a worker config from scratch)

talosctl gen config my-cluster https://mycluster.local:6443 --config-patch '[{"op": "add", "path": "/machine/sysctls", "value": {"vm.nr_hugepages": "1024"}}, {"op": "add", "path": "/machine/kubelet/extraArgs", "value": {"node-labels": "openebs.io/engine=mayastor"}}]'

Patching an existing node

talosctl patch --mode=no-reboot machineconfig -n <node ip> --patch '[{"op": "add", "path": "/machine/sysctls", "value": {"vm.nr_hugepages": "1024"}}, {"op": "add", "path": "/machine/kubelet/extraArgs", "value": {"node-labels": "openebs.io/engine=mayastor"}}]'
  1. Set storageClass to manual for bootstrapping, as there are no storage classes available in the on-prem Talos cluster. Modify values.yaml as shown in the provided diff:
    
    diff --git a/storage-provisioner/mayastor/values.yaml b/storage-provisioner/mayastor/values.yaml
    index ead3aad..fafc93b 100644
    --- a/storage-provisioner/mayastor/values.yaml
    +++ b/storage-provisioner/mayastor/values.yaml
    @@ -311,7 +311,8 @@ etcd:
     # -- Will define which storageClass to use in etcd's StatefulSets
     # a `manual` storageClass will provision a hostpath PV on the same node
     # an empty storageClass will use the default StorageClass on the cluster
    -    storageClass: ""
    +    # need to set this to manual in order to bootstrap, update to Mayastor later:
    +    storageClass: manual

@@ -385,7 +386,8 @@ loki-stack:

-- StorageClass for Loki's centralised log storage

   # empty storageClass implies cluster default storageClass & `manual` creates a static hostpath PV
  1. Pre-apply the mayastor-namespace.yaml file:

    apiVersion: v1
    kind: Namespace
    metadata:
    labels:
    pod-security.kubernetes.io/audit: privileged
    pod-security.kubernetes.io/enforce: privileged
    pod-security.kubernetes.io/warn: privileged
    name: mayastor

    This was necessary, as Talos linux enforces tight security policies by default, I found the relevant hint here.

  2. Run the command helm install mayastor mayastor/mayastor -n mayastor --version 2.1.0 -f values.yaml.

  3. Check the pods status with kubectl get pods -n mayastor.

  4. CLI argument --pool-commitment isn't valid in this context.

The faulty command seems to be:

docker.io/openebs/mayastor-agent-core:v2.0.1
      -smayastor-etcd:2379
      --request-timeout=5s
      --cache-period=30s
      --grpc-server-addr=0.0.0.0:50051
      --pool-commitment=250%   <------------ !
      --volume-commitment-initial=40%
      --volume-commitment=40%
❯ kubectl logs -n mayastor mayastor-agent-core-b89449b85-ttdqh --container agent-core -f           
error: Found argument '--pool-commitment' which wasn't expected, or isn't valid in this context

USAGE:
    core --cache-period <cache-period> --grpc-server-addr <grpc-server-addr> --request-timeout <request-timeout> --store <store>

For more information try --help

Command Logs Please refer to the following command outputs, which describe the issue in detail:

❯ helm install mayastor mayastor/mayastor -n mayastor --version 2.1.0 -f values.yaml                                                               
NAME: mayastor                                                                                                                                     
LAST DEPLOYED: Sat Apr 29 20:41:22 2023                                                                                                            
NAMESPACE: mayastor                                                                                                                                
STATUS: deployed                                                                                                                                   
REVISION: 1                                                                                                                                        
NOTES:                                                                                                                                             
OpenEBS Mayastor has been installed.
❯ kubectl get pods -n mayastor                                                                                                                     
NAME                                         READY   STATUS             RESTARTS      AGE                                                          
mayastor-agent-core-b89449b85-ttdqh          1/2     CrashLoopBackOff   4 (28s ago)   3m48s                                                        
mayastor-agent-ha-node-57c8v                 0/1     Init:0/1           0             3m48s                                                        
mayastor-agent-ha-node-ggbc9                 0/1     Init:0/1           0             3m48s                                                        
mayastor-agent-ha-node-px25q                 0/1     Init:0/1           0             3m48s                                                        
mayastor-api-rest-55cc4b6765-gzmjp           0/1     Init:0/2           0             3m48s                                                        
mayastor-csi-controller-66c6b4f87-xjpc7      0/3     Init:0/1           0             3m48s                                                        
mayastor-csi-node-bsw94                      2/2     Running            0             3m48s                                                        
mayastor-csi-node-hdg96                      2/2     Running            0             3m48s                                                        
mayastor-csi-node-mw9hg                      2/2     Running            0             3m48s                                                        
mayastor-etcd-0                              1/1     Running            0             3m48s                                                        
mayastor-etcd-1                              1/1     Running            0             3m48s                                                        
mayastor-etcd-2                              1/1     Running            0             3m48s                                                        
mayastor-io-engine-48r2s                     0/2     Init:0/2           0             3m48s                                                        
mayastor-io-engine-b2k48                     0/2     Init:0/2           0             3m48s                                                        mayastor-io-engine-chfng                     0/2     Init:0/2           0             3m48s                                                        
mayastor-loki-0                              1/1     Running            0             3m48s                                                        
mayastor-obs-callhome-84f99775dc-9k9dp       1/1     Running            0             3m48s                                                        
mayastor-operator-diskpool-d74cbb49f-9tq9v   0/1     Init:0/2           0             3m48s                                                        
mayastor-promtail-2w646                      1/1     Running            0             3m48s                                                        
mayastor-promtail-h4gbn                      1/1     Running            0             3m48s                                                        
mayastor-promtail-v6wnk                      1/1     Running            0             3m48s   
❯ kubectl logs -n mayastor mayastor-agent-core-b89449b85-ttdqh --container agent-core -f           
error: Found argument '--pool-commitment' which wasn't expected, or isn't valid in this context

USAGE:
    core --cache-period <cache-period> --grpc-server-addr <grpc-server-addr> --request-timeout <request-timeout> --store <store>

For more information try --help
❯ k describe pod -n mayastor mayastor-agent-core-b89449b85-ttdqh                                                                                   
Name:                 mayastor-agent-core-b89449b85-ttdqh                                                                                          
Namespace:            mayastor                                                                                                                     
Priority:             2000000000                                                                                                                   
Priority Class Name:  system-cluster-critical                                                                                                      
Service Account:      mayastor-service-account                                                                                                     
Node:                 talos-k1v-tpt/XX.XXX.X.X                                                                                                  
Start Time:           Sat, 29 Apr 2023 20:41:33 +0200                                                                                              
Labels:               app=agent-core                                                                                                               
                      openebs.io/logging=true                                                                                                      
                      openebs.io/release=mayastor                                                                                                  
                      openebs.io/version=2.1.0                                                                                                     
                      pod-template-hash=b89449b85                                                                                                  
Annotations:          <none>                                                                                                                       
Status:               Running                                                                                                                      
IP:                   XX.XXX.X.X                                                                                                                   
IPs:                                                                                                                                               
  IP:           XX.XXX.X.X                                                                                                                         
Controlled By:  ReplicaSet/mayastor-agent-core-b89449b85                                                                                           
Init Containers:                                                                                                                                   
  etcd-probe:                                                                                                                                      
    Container ID:  containerd://32f414dac902218a960f7fd7d00997711a0bce714c2b56eaaea10560a0ec392a                                                   
    Image:         busybox:latest                                                                                                                  
    Image ID:      docker.io/library/busybox@sha256:b5d6fe0712636ceb7430189de28819e195e8966372edfc2d9409d79402a0dc16                               
    Port:          <none>                                                                                                                          
    Host Port:     <none>                                                                                                                          
    Command:                                                                                                                                       
      sh                                                                                                                                           
      -c                                                                                                                                           
      trap "exit 1" TERM; until nc -vzw 5 mayastor-etcd 2379; do date; echo "Waiting for etcd..."; sleep 1; done;
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 29 Apr 2023 20:41:39 +0200
      Finished:     Sat, 29 Apr 2023 20:43:09 +0200
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cktnm (ro)
Containers:
  agent-core:
    Container ID:  containerd://7ae3b77d1d1215ae51f18e1411a6d38687a1eb12fd8985b0ee99280dd25caea0
    Image:         docker.io/openebs/mayastor-agent-core:v2.0.1
    Image ID:      docker.io/openebs/mayastor-agent-core@sha256:d64e6499a34e1d9a3fbc8c65dea5dede4192ec729c9716b2ad3a7474efa6b4c6
    Port:          50051/TCP
    Host Port:     0/TCP
    Args:
      -smayastor-etcd:2379
      --request-timeout=5s
      --cache-period=30s
      --grpc-server-addr=0.0.0.0:50051
      --pool-commitment=250%
      --volume-commitment-initial=40%
      --volume-commitment=40%
    State:          Waiting
      Reason:       CrashLoopBackOff 
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sat, 29 Apr 2023 20:49:12 +0200
      Finished:     Sat, 29 Apr 2023 20:49:12 +0200
    Ready:          False
    Restart Count:  6
    Limits:
      cpu:     1
      memory:  128Mi
    Requests:
      cpu:     500m
      memory:  32Mi
    Environment:
      RUST_LOG:          info
      MY_POD_NAME:       mayastor-agent-core-b89449b85-ttdqh (v1:metadata.name)
      MY_POD_NAMESPACE:  mayastor (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cktnm (ro)
  agent-ha-cluster:
    Container ID:  containerd://91dbeb98d321e6cf1e534bfaa40e95c502a603b321f09afef36a6a30448d0b83
    Image:         docker.io/openebs/mayastor-agent-ha-cluster:v2.0.1
    Image ID:      docker.io/openebs/mayastor-agent-ha-cluster@sha256:9937911202570ca03074486ae4b08a149fa20ead28a62a66fa5cf228ef3690e7
    Port:          50052/TCP
    Host Port:     0/TCP
    Args:
      -g=0.0.0.0:50052
      --store=http://mayastor-etcd:2379
      --core-grpc=https://mayastor-agent-core:50051
    State:          Running
      Started:      Sat, 29 Apr 2023 20:43:19 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  64Mi
    Requests:
      cpu:     100m
      memory:  16Mi
    Environment:
      RUST_LOG:          info
      MY_POD_NAME:       mayastor-agent-core-b89449b85-ttdqh (v1:metadata.name)
      MY_POD_NAMESPACE:  mayastor (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cktnm (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-cktnm:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/arch=amd64
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 5s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 5s
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  10m                    default-scheduler  Successfully assigned mayastor/mayastor-agent-core-b89449b85-ttdqh to talos-k1v-tpt
  Normal   Pulling    10m                    kubelet            Pulling image "busybox:latest"
  Normal   Pulled     10m                    kubelet            Successfully pulled image "busybox:latest" in 3.972393053s (3.972413893s including 
waiting)
  Normal   Created    10m                    kubelet            Created container etcd-probe
  Normal   Started    10m                    kubelet            Started container etcd-probe
  Normal   Pulling    8m42s                  kubelet            Pulling image "docker.io/openebs/mayastor-agent-core:v2.0.1"
  Normal   Pulling    8m37s                  kubelet            Pulling image "docker.io/openebs/mayastor-agent-ha-cluster:v2.0.1"
  Normal   Pulled     8m33s                  kubelet            Successfully pulled image "docker.io/openebs/mayastor-agent-ha-cluster:v2.0.1" in 3
.993577545s (3.993587884s including waiting)
  Normal   Created    8m33s                  kubelet            Created container agent-ha-cluster
  Normal   Started    8m33s                  kubelet            Started container agent-ha-cluster
  Normal   Created    7m52s (x4 over 8m37s)  kubelet            Created container agent-core
  Normal   Started    7m52s (x4 over 8m37s)  kubelet            Started container agent-core
  Normal   Pulled     7m52s (x3 over 8m32s)  kubelet            Container image "docker.io/openebs/mayastor-agent-core:v2.0.1" already present on m
achine
  Warning  BackOff    13s (x40 over 8m31s)   kubelet            Back-off restarting failed container agent-core in pod mayastor-agent-core-b89449b8
5-ttdqh_mayastor(364dd227-c100-4814-b329-9bb5f0358965)
csnyder616 commented 1 year ago

I'm able to work around this by making two edits to the deployment for openebs-agent-core:

tiagolobocastro commented 1 year ago

How odd, were these fresh installs of 2.1.0? Then the images should be set to 2.0.1 for the 2.1.0 release... :/ I've just double checked both the helm tar and also tried a fresh install and I do get 2.1.0 tags.

tiagolobocastro commented 1 year ago

@Titus-von-Koeller is by any chance your values.yaml setting the image tags to 2.0.1? Could you share your values.yaml please? @csnyder616 do you perhaps also have similar values.yaml setting the tags? If so simply changing the container tag might not be ideal as you might get a install with mixed tags.

Titus-von-Koeller commented 1 year ago

I've realized my mistake now.

Initially, I was working with the Mayastor version v2.0.1 and used the corresponding values.yaml file. However, by the time I performed a fresh install on a fresh cluster, v2.1.0 was released. I installed the Helm chart v2.1.0 but used the v2.0.1 based values.yaml, as I didn't realize that the values.yaml file was version-dependent: I'm still new to using Helm.

I'll attach my values.yaml file for reference and will update you later today or by tomorrow at the latest on whether the installation succeeded with the v2.1.0 based values.yaml and close the issue. Apologies for this oversight, and thank you for the clarification.

image:
  # -- Image registry to pull our product images
  registry: docker.io
  # -- Image registry's namespace
  repo: openebs
  # -- Release tag for our images
  tag: v2.0.1
  # -- ImagePullPolicy for our images
  pullPolicy: IfNotPresent

# -- Node labels for pod assignment
# ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
# Note that if multi-arch images support 'kubernetes.io/arch: amd64'
# should be removed and set 'nodeSelector' to empty '{}' as default value.
nodeSelector:
  kubernetes.io/arch: amd64

earlyEvictionTolerations:
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 5
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 5

base:
  # -- Request timeout for rest & core agents
  default_req_timeout: 5s
  # -- Cache timeout for core agent & diskpool deployment
  cache_poll_period: 30s
  # -- Silence specific module components
  logSilenceLevel:
  initContainers:
    enabled: true
    containers:
      - name: agent-core-grpc-probe
        image: busybox:latest
        command: ['sh', '-c', 'trap "exit 1" TERM; until nc -vzw 5 {{ .Release.Name }}-agent-core 50051; do date; echo "Waiting for agent-core-grpc services..."; sleep 1; done;']
      - name: etcd-probe
        image: busybox:latest
        command: ['sh', '-c', 'trap "exit 1" TERM; until nc -vzw 5 {{ .Release.Name }}-etcd {{.Values.etcd.service.port}}; do date; echo "Waiting for etcd..."; sleep 1; done;']
  initHaNodeContainers:
    enabled: true
    containers:
      - name: agent-cluster-grpc-probe
        image: busybox:latest
        command: ['sh', '-c', 'trap "exit 1" TERM; until nc -vzw 5 {{ .Release.Name }}-agent-core 50052; do date; echo "Waiting for agent-cluster-grpc services..."; sleep 1; done;']
  initCoreContainers:
    enabled: true
    containers:
      - name: etcd-probe
        image: busybox:latest
        command: ['sh', '-c', 'trap "exit 1" TERM; until nc -vzw 5 {{ .Release.Name }}-etcd {{.Values.etcd.service.port}}; do date; echo "Waiting for etcd..."; sleep 1; done;']
  # docker-secrets required to pull images if the container registry from image.Registry is protected
  imagePullSecrets:
    # -- Enable imagePullSecrets for pulling our container images
    enabled: false
    # Name of the imagePullSecret in the installed namespace
    secrets:
      - name: login

  metrics:
    # -- Enable the metrics exporter
    enabled: true
    # metrics refresh time
    # WARNING: Lowering pollingInterval value will affect performance adversely
    pollingInterval: "5m"

  jaeger:
    # -- Enable jaeger tracing
    enabled: false
    initContainer: true
    agent:
      name: jaeger-agent
      port: 6831
      initContainer:
        - name: jaeger-probe
          image: busybox:latest
          command: ['sh', '-c', 'trap "exit 1" TERM; until nc -vzw 5 -u {{.Values.base.jaeger.agent.name}} {{.Values.base.jaeger.agent.port}}; do date; echo "Waiting for jaeger..."; sleep 1; done;']
  initRestContainer:
    enabled: true
    initContainer:
      - name: api-rest-probe
        image: busybox:latest
        command: ['sh', '-c', 'trap "exit 1" TERM; until nc -vzw 5 {{ .Release.Name }}-api-rest 8081; do date; echo "Waiting for REST API endpoint to become available"; sleep 1; done;']

operators:
  pool:
    # -- Log level for diskpool operator service
    logLevel: info
    resources:
      limits:
        # -- Cpu limits for diskpool operator
        cpu: "100m"
        # -- Memory limits for diskpool operator
        memory: "32Mi"
      requests:
        # -- Cpu requests for diskpool operator
        cpu: "50m"
        # -- Memory requests for diskpool operator
        memory: "16Mi"

jaeger-operator:
  # Name of jaeger operator
  name: "{{ .Release.Name }}"
  crd:
    # Install jaeger CRDs
    install: false
  jaeger:
    # Install jaeger-operator
    create: false
  rbac:
    # Create a clusterRole for Jaeger
    clusterRole: true

agents:
  core:
    # -- Log level for the core service
    logLevel: info
    resources:
      limits:
        # -- Cpu limits for core agents
        cpu: "1000m"
        # -- Memory limits for core agents
        memory: "128Mi"
      requests:
        # -- Cpu requests for core agents
        cpu: "500m"
        # -- Memory requests for core agents
        memory: "32Mi"
  ha:
    enabled: true
    node:
      # -- Log level for the ha node service
      logLevel: info
      resources:
        limits:
          # -- Cpu limits for ha node agent
          cpu: "100m"
          # -- Memory limits for ha node agent
          memory: "64Mi"
        requests:
          # -- Cpu requests for ha node agent
          cpu: "100m"
          # -- Memory requests for ha node agent
          memory: "64Mi"
    cluster:
      # -- Log level for the ha cluster service
      logLevel: info
      resources:
        limits:
          # -- Cpu limits for ha cluster agent
          cpu: "100m"
          # -- Memory limits for ha cluster agent
          memory: "64Mi"
        requests:
          # -- Cpu requests for ha cluster agent
          cpu: "100m"
          # -- Memory requests for ha cluster agent
          memory: "16Mi"

apis:
  rest:
    # -- Log level for the rest service
    logLevel: info
    # -- Number of replicas of rest
    replicaCount: 1
    resources:
      limits:
        # -- Cpu limits for rest
        cpu: "100m"
        # -- Memory limits for rest
        memory: "64Mi"
      requests:
        # -- Cpu requests for rest
        cpu: "50m"
        # -- Memory requests for rest
        memory: "32Mi"
    # Rest service parameters define how the rest service is exposed
    service:
      # -- Rest K8s service type
      type: ClusterIP
      # Ports from where rest endpoints are accessible from outside the cluster, only valid if type is NodePort
      nodePorts:
        # NodePort associated with http port
        http: 30011
        # NodePort associated with https port
        https: 30010

csi:
  image:
    # -- Image registry to pull all CSI Sidecar images
    registry: registry.k8s.io
    # -- Image registry's namespace
    repo: sig-storage
    # -- imagePullPolicy for all CSI Sidecar images
    pullPolicy: IfNotPresent
    # -- csi-provisioner image release tag
    provisionerTag: v2.2.1
    # -- csi-attacher image release tag
    attacherTag: v3.2.1
    # -- csi-node-driver-registrar image release tag
    registrarTag: v2.1.0

  controller:
    # -- Log level for the csi controller
    logLevel: info
    resources:
      limits:
        # -- Cpu limits for csi controller
        cpu: "32m"
        # -- Memory limits for csi controller
        memory: "128Mi"
      requests:
        # -- Cpu requests for csi controller
        cpu: "16m"
        # -- Memory requests for csi controller
        memory: "64Mi"
  node:
    logLevel: info
    topology:
      segments:
        openebs.io/csi-node: mayastor
      # -- Add topology segments to the csi-node daemonset node selector
      nodeSelector: false
    resources:
      limits:
        # -- Cpu limits for csi node plugin
        cpu: "100m"
        # -- Memory limits for csi node plugin
        memory: "128Mi"
      requests:
        # -- Cpu requests for csi node plugin
        cpu: "100m"
        # -- Memory requests for csi node plugin
        memory: "64Mi"
    nvme:
      # -- The nvme_core module io timeout in seconds
      io_timeout: "30"
      # -- The ctrl_loss_tmo (controller loss timeout) in seconds
      ctrl_loss_tmo: "1980"
      # Kato (keep alive timeout) in seconds
      keep_alive_tmo: ""
    # -- The kubeletDir directory for the csi-node plugin
    kubeletDir: /var/lib/kubelet
    pluginMounthPath: /csi
    socketPath: csi.sock

io_engine:
  # -- Log level for the io-engine service
  logLevel: info,io_engine=info
  api: "v1"
  target:
    nvmf:
      # -- NVMF target interface (ip, mac, name or subnet)
      iface: ""
      # -- Reservations Persist Through Power Loss State
      ptpl: true
  # -- Pass additional arguments to the Environment Abstraction Layer.
  # Example: --set {product}.envcontext=iova-mode=pa
  envcontext: ""
  reactorFreezeDetection:
    enabled: false
  # -- The number of cpu that each io-engine instance will bind to.
  cpuCount: "2"
  # -- Node selectors to designate storage nodes for diskpool creation
  # Note that if multi-arch images support 'kubernetes.io/arch: amd64'
  # should be removed.
  nodeSelector:
    openebs.io/engine: mayastor
    kubernetes.io/arch: amd64
  resources:
    limits:
      # -- Cpu limits for the io-engine
      cpu: ""
      # -- Memory limits for the io-engine
      memory: "1Gi"
      # -- Hugepage size available on the nodes
      hugepages2Mi: "2Gi"
    requests:
      # -- Cpu requests for the io-engine
      cpu: ""
      # -- Memory requests for the io-engine
      memory: "1Gi"
      # -- Hugepage size available on the nodes
      hugepages2Mi: "2Gi"

etcd:
  env:
    # seeing CrashLoopBackOff, need to add debugging logging for analysis. Remove later!
    ETCD_LOG_LEVEL: debug
  # Pod labels; okay to remove the openebs logging label if required
  podLabels:
    app: etcd
    openebs.io/logging: "true"
  # -- Number of replicas of etcd
  replicaCount: 3
  # Kubernetes Cluster Domain
  clusterDomain: cluster.local
  # TLS authentication for client-to-server communications
  # ref: https://etcd.io/docs/current/op-guide/security/
  client:
    secureTransport: false
  # TLS authentication for server-to-server communications
  # ref: https://etcd.io/docs/current/op-guide/security/
  peer:
    secureTransport: false
  # Enable persistence using Persistent Volume Claims
  persistence:
    # -- If true, use a Persistent Volume Claim. If false, use emptyDir.
    enabled: true
    # -- Will define which storageClass to use in etcd's StatefulSets
    # a `manual` storageClass will provision a hostpath PV on the same node
    # an empty storageClass will use the default StorageClass on the cluster
    # storageClass: ""
    storageClass: manual # SLX: need to set this to manual in order to bootstrap, update to Mayastor later
    # -- Volume size
    size: 2Gi
    # -- PVC's reclaimPolicy
    reclaimPolicy: "Delete"
  # -- Use a PreStop hook to remove the etcd members from the etcd cluster on container termination
  # Ignored if lifecycleHooks is set or replicaCount=1
  removeMemberOnContainerTermination: false

  # -- AutoCompaction
  # Since etcd keeps an exact history of its keyspace, this history should be
  # periodically compacted to avoid performance degradation
  # and eventual storage space exhaustion.
  # Auto compaction mode. Valid values: "periodic", "revision".
  # - 'periodic' for duration based retention, defaulting to hours if no time unit is provided (e.g. 5m).
  # - 'revision' for revision number based retention.
  autoCompactionMode: revision
  # -- Auto compaction retention length. 0 means disable auto compaction.
  autoCompactionRetention: 100
  extraEnvVars:
    # -- Raise alarms when backend size exceeds the given quota.
    - name: ETCD_QUOTA_BACKEND_BYTES
      value: "8589934592"

  auth:
    rbac:
      create: false
      enabled: false
      allowNoneAuthentication: true
  # Init containers parameters:
  # volumePermissions: Change the owner and group of the persistent volume mountpoint to runAsUser:fsGroup values from the securityContext section.
  #
  volumePermissions:
    # chown the mounted volume; this is required if a statically provisioned hostpath volume is used
    enabled: true
  # extra debug information on logs
  debug: false
  initialClusterState: "new"
  # Pod anti-affinity preset
  # Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity
  podAntiAffinityPreset: "hard"

  # etcd service parameters defines how the etcd service is exposed
  service:
    # K8s service type
    type: ClusterIP

    # etcd client port
    port: 2379

    # Specify the nodePort(s) value(s) for the LoadBalancer and NodePort service types.
    # ref: https://kubernetes.io/docs/concepts/services-networking/service/#type-nodeport
    #
    nodePorts:
      # Port from where etcd endpoints are accessible from outside cluster
      clientPort: 31379
      peerPort: ""

loki-stack:
  # -- Enable loki log collection for our components
  enabled: true
  loki:
    rbac:
      # -- Create rbac roles for loki
      create: true
      pspEnabled: false
    # -- Enable loki installation as part of loki-stack
    enabled: true
    # Install loki with persistence storage
    persistence:
      # -- Enable persistence storage for the logs
      enabled: true
      # -- StorageClass for Loki's centralised log storage
      # empty storageClass implies cluster default storageClass & `manual` creates a static hostpath PV
      # storageClassName: ""
      storageClassName: manual # SLX: need to set this to manual in order to bootstrap, update to Mayastor later
      # -- PVC's ReclaimPolicy, can be Delete or Retain
      reclaimPolicy: "Delete"
      # -- Size of Loki's persistence storage
      size: 10Gi
    # loki process run & file permissions, required if sc=manual
    securityContext:
      fsGroup: 1001
      runAsGroup: 1001
      runAsNonRoot: false
      runAsUser: 1001
    # initContainers to chown the static hostpath PV by 1001 user
    initContainers:
      - command: ["/bin/bash", "-ec", "chown -R 1001:1001 /data"]
        image: docker.io/bitnami/bitnami-shell:10
        imagePullPolicy: IfNotPresent
        name: volume-permissions
        securityContext:
          runAsUser: 0
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
          - mountPath: /data
            name: storage
    config:
      # Compactor is a BoltDB(loki term) Shipper specific service that reduces the index
      # size by deduping the index and merging all the files to a single file per table.
      # Ref: https://grafana.com/docs/loki/latest/operations/storage/retention/
      compactor:
        # Dictates how often compaction and/or retention is applied. If the
        # Compactor falls behind, compaction and/or retention occur as soon as possible.
        compaction_interval: 20m

        # If not enabled compactor will only compact table but they will not get
        # deleted
        retention_enabled: true

        # The delay after which the compactor will delete marked chunks
        retention_delete_delay: 1h

        # Specifies the maximum quantity of goroutine workers instantiated to
        # delete chunks
        retention_delete_worker_count: 50

      # Rentention period of logs is configured within the limits_config section
      limits_config:
        # configuring retention period for logs
        retention_period: 168h

    # Loki service parameters defines how the Loki service is exposed
    service:
      # K8s service type
      type: ClusterIP
      port: 3100
      # Port where REST endpoints of Loki are accessible from outside cluster
      nodePort: 31001

  # promtail configuration
  promtail:
    rbac:
      # create rbac roles for promtail
      create: true
      pspEnabled: false
    # -- Enables promtail for scraping logs from nodes
    enabled: true
    # -- Disallow promtail from running on the master node
    tolerations: []
    config:
      # -- The Loki address to post logs to
      lokiAddress: http://{{ .Release.Name }}-loki:3100/loki/api/v1/push
      snippets:
        # Promtail will export logs to loki only based on based on below
        # configuration, below scrape config will export only our services
        # which are labeled with openebs.io/logging=true
        scrapeConfigs: |
          - job_name: {{ .Release.Name }}-pods-name
            pipeline_stages:
              - docker: {}
            kubernetes_sd_configs:
            - role: pod
            relabel_configs:
            - source_labels:
              - __meta_kubernetes_pod_node_name
              target_label: hostname
              action: replace
            - action: labelmap
              regex: __meta_kubernetes_pod_label_(.+)
            - action: keep
              source_labels:
              - __meta_kubernetes_pod_label_openebs_io_logging
              regex: true
              target_label: {{ .Release.Name }}_component
            - action: replace
              replacement: $1
              separator: /
              source_labels:
              - __meta_kubernetes_namespace
              target_label: job
            - action: replace
              source_labels:
              - __meta_kubernetes_pod_name
              target_label: pod
            - action: replace
              source_labels:
              - __meta_kubernetes_pod_container_name
              target_label: container
            - replacement: /var/log/pods/*$1/*.log
              separator: /
              source_labels:
              - __meta_kubernetes_pod_uid
              - __meta_kubernetes_pod_container_name
              target_label: __path__
obs:
  callhome:
    # -- Enable callhome
    enabled: true
    # -- Log level for callhome
    logLevel: "info"
    resources:
      limits:
        # -- Cpu limits for callhome
        cpu: "100m"
        # -- Memory limits for callhome
        memory: "32Mi"
      requests:
        # -- Cpu requests for callhome
        cpu: "50m"
        # -- Memory requests for callhome
        memory: "16Mi"
tiagolobocastro commented 1 year ago

Awesome, thank you for clarifying @Titus-von-Koeller ! Btw rather than keeping the full values.yaml, you can keep a subset with just the changes: helm install mayastor mayastor/mayastor -n mayastor --version 2.1.0 -f mod.yaml. You can also specify it directly, example: helm install mayastor mayastor/mayastor -n mayastor --version 2.1.0 --set='etcd.persistence.storageClass=manual,loki-stack.loki.persistence.storageClassName=manual'

Titus-von-Koeller commented 1 year ago

@tiagolobocastro Thanks a lot for that useful hint and getting back to me so quickly.

The values.yaml (taken from v2.0.1) was indeed at fault.

I'm glad that I'll be able to avoid that issue in the future with your proposed patching approach.

Thanks again for your help and building this amazing product. The level of quality in FOSS software these days in certain cases is just mind-boggling. I'll be on my feet to find more ways to contribute to the community as well.

I ran a demo with it yesterday, using the benchmark, and we were super impressed when we realized that the write throughput completely maxed out our network connection. Nice, this will be great for high performance ML applications (we need a new network interlink..) among other things.

csnyder616 commented 1 year ago

I'm guessing my issue was similar: when I upgraded, I used the --reuse-values flag.