Closed Titus-von-Koeller closed 1 year ago
I'm able to work around this by making two edits to the deployment for openebs-agent-core:
How odd, were these fresh installs of 2.1.0? Then the images should be set to 2.0.1 for the 2.1.0 release... :/ I've just double checked both the helm tar and also tried a fresh install and I do get 2.1.0 tags.
@Titus-von-Koeller is by any chance your values.yaml setting the image tags to 2.0.1? Could you share your values.yaml please? @csnyder616 do you perhaps also have similar values.yaml setting the tags? If so simply changing the container tag might not be ideal as you might get a install with mixed tags.
I've realized my mistake now.
Initially, I was working with the Mayastor version v2.0.1 and used the corresponding values.yaml file. However, by the time I performed a fresh install on a fresh cluster, v2.1.0 was released. I installed the Helm chart v2.1.0 but used the v2.0.1 based values.yaml, as I didn't realize that the values.yaml file was version-dependent: I'm still new to using Helm.
I'll attach my values.yaml file for reference and will update you later today or by tomorrow at the latest on whether the installation succeeded with the v2.1.0 based values.yaml and close the issue. Apologies for this oversight, and thank you for the clarification.
image:
# -- Image registry to pull our product images
registry: docker.io
# -- Image registry's namespace
repo: openebs
# -- Release tag for our images
tag: v2.0.1
# -- ImagePullPolicy for our images
pullPolicy: IfNotPresent
# -- Node labels for pod assignment
# ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
# Note that if multi-arch images support 'kubernetes.io/arch: amd64'
# should be removed and set 'nodeSelector' to empty '{}' as default value.
nodeSelector:
kubernetes.io/arch: amd64
earlyEvictionTolerations:
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 5
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 5
base:
# -- Request timeout for rest & core agents
default_req_timeout: 5s
# -- Cache timeout for core agent & diskpool deployment
cache_poll_period: 30s
# -- Silence specific module components
logSilenceLevel:
initContainers:
enabled: true
containers:
- name: agent-core-grpc-probe
image: busybox:latest
command: ['sh', '-c', 'trap "exit 1" TERM; until nc -vzw 5 {{ .Release.Name }}-agent-core 50051; do date; echo "Waiting for agent-core-grpc services..."; sleep 1; done;']
- name: etcd-probe
image: busybox:latest
command: ['sh', '-c', 'trap "exit 1" TERM; until nc -vzw 5 {{ .Release.Name }}-etcd {{.Values.etcd.service.port}}; do date; echo "Waiting for etcd..."; sleep 1; done;']
initHaNodeContainers:
enabled: true
containers:
- name: agent-cluster-grpc-probe
image: busybox:latest
command: ['sh', '-c', 'trap "exit 1" TERM; until nc -vzw 5 {{ .Release.Name }}-agent-core 50052; do date; echo "Waiting for agent-cluster-grpc services..."; sleep 1; done;']
initCoreContainers:
enabled: true
containers:
- name: etcd-probe
image: busybox:latest
command: ['sh', '-c', 'trap "exit 1" TERM; until nc -vzw 5 {{ .Release.Name }}-etcd {{.Values.etcd.service.port}}; do date; echo "Waiting for etcd..."; sleep 1; done;']
# docker-secrets required to pull images if the container registry from image.Registry is protected
imagePullSecrets:
# -- Enable imagePullSecrets for pulling our container images
enabled: false
# Name of the imagePullSecret in the installed namespace
secrets:
- name: login
metrics:
# -- Enable the metrics exporter
enabled: true
# metrics refresh time
# WARNING: Lowering pollingInterval value will affect performance adversely
pollingInterval: "5m"
jaeger:
# -- Enable jaeger tracing
enabled: false
initContainer: true
agent:
name: jaeger-agent
port: 6831
initContainer:
- name: jaeger-probe
image: busybox:latest
command: ['sh', '-c', 'trap "exit 1" TERM; until nc -vzw 5 -u {{.Values.base.jaeger.agent.name}} {{.Values.base.jaeger.agent.port}}; do date; echo "Waiting for jaeger..."; sleep 1; done;']
initRestContainer:
enabled: true
initContainer:
- name: api-rest-probe
image: busybox:latest
command: ['sh', '-c', 'trap "exit 1" TERM; until nc -vzw 5 {{ .Release.Name }}-api-rest 8081; do date; echo "Waiting for REST API endpoint to become available"; sleep 1; done;']
operators:
pool:
# -- Log level for diskpool operator service
logLevel: info
resources:
limits:
# -- Cpu limits for diskpool operator
cpu: "100m"
# -- Memory limits for diskpool operator
memory: "32Mi"
requests:
# -- Cpu requests for diskpool operator
cpu: "50m"
# -- Memory requests for diskpool operator
memory: "16Mi"
jaeger-operator:
# Name of jaeger operator
name: "{{ .Release.Name }}"
crd:
# Install jaeger CRDs
install: false
jaeger:
# Install jaeger-operator
create: false
rbac:
# Create a clusterRole for Jaeger
clusterRole: true
agents:
core:
# -- Log level for the core service
logLevel: info
resources:
limits:
# -- Cpu limits for core agents
cpu: "1000m"
# -- Memory limits for core agents
memory: "128Mi"
requests:
# -- Cpu requests for core agents
cpu: "500m"
# -- Memory requests for core agents
memory: "32Mi"
ha:
enabled: true
node:
# -- Log level for the ha node service
logLevel: info
resources:
limits:
# -- Cpu limits for ha node agent
cpu: "100m"
# -- Memory limits for ha node agent
memory: "64Mi"
requests:
# -- Cpu requests for ha node agent
cpu: "100m"
# -- Memory requests for ha node agent
memory: "64Mi"
cluster:
# -- Log level for the ha cluster service
logLevel: info
resources:
limits:
# -- Cpu limits for ha cluster agent
cpu: "100m"
# -- Memory limits for ha cluster agent
memory: "64Mi"
requests:
# -- Cpu requests for ha cluster agent
cpu: "100m"
# -- Memory requests for ha cluster agent
memory: "16Mi"
apis:
rest:
# -- Log level for the rest service
logLevel: info
# -- Number of replicas of rest
replicaCount: 1
resources:
limits:
# -- Cpu limits for rest
cpu: "100m"
# -- Memory limits for rest
memory: "64Mi"
requests:
# -- Cpu requests for rest
cpu: "50m"
# -- Memory requests for rest
memory: "32Mi"
# Rest service parameters define how the rest service is exposed
service:
# -- Rest K8s service type
type: ClusterIP
# Ports from where rest endpoints are accessible from outside the cluster, only valid if type is NodePort
nodePorts:
# NodePort associated with http port
http: 30011
# NodePort associated with https port
https: 30010
csi:
image:
# -- Image registry to pull all CSI Sidecar images
registry: registry.k8s.io
# -- Image registry's namespace
repo: sig-storage
# -- imagePullPolicy for all CSI Sidecar images
pullPolicy: IfNotPresent
# -- csi-provisioner image release tag
provisionerTag: v2.2.1
# -- csi-attacher image release tag
attacherTag: v3.2.1
# -- csi-node-driver-registrar image release tag
registrarTag: v2.1.0
controller:
# -- Log level for the csi controller
logLevel: info
resources:
limits:
# -- Cpu limits for csi controller
cpu: "32m"
# -- Memory limits for csi controller
memory: "128Mi"
requests:
# -- Cpu requests for csi controller
cpu: "16m"
# -- Memory requests for csi controller
memory: "64Mi"
node:
logLevel: info
topology:
segments:
openebs.io/csi-node: mayastor
# -- Add topology segments to the csi-node daemonset node selector
nodeSelector: false
resources:
limits:
# -- Cpu limits for csi node plugin
cpu: "100m"
# -- Memory limits for csi node plugin
memory: "128Mi"
requests:
# -- Cpu requests for csi node plugin
cpu: "100m"
# -- Memory requests for csi node plugin
memory: "64Mi"
nvme:
# -- The nvme_core module io timeout in seconds
io_timeout: "30"
# -- The ctrl_loss_tmo (controller loss timeout) in seconds
ctrl_loss_tmo: "1980"
# Kato (keep alive timeout) in seconds
keep_alive_tmo: ""
# -- The kubeletDir directory for the csi-node plugin
kubeletDir: /var/lib/kubelet
pluginMounthPath: /csi
socketPath: csi.sock
io_engine:
# -- Log level for the io-engine service
logLevel: info,io_engine=info
api: "v1"
target:
nvmf:
# -- NVMF target interface (ip, mac, name or subnet)
iface: ""
# -- Reservations Persist Through Power Loss State
ptpl: true
# -- Pass additional arguments to the Environment Abstraction Layer.
# Example: --set {product}.envcontext=iova-mode=pa
envcontext: ""
reactorFreezeDetection:
enabled: false
# -- The number of cpu that each io-engine instance will bind to.
cpuCount: "2"
# -- Node selectors to designate storage nodes for diskpool creation
# Note that if multi-arch images support 'kubernetes.io/arch: amd64'
# should be removed.
nodeSelector:
openebs.io/engine: mayastor
kubernetes.io/arch: amd64
resources:
limits:
# -- Cpu limits for the io-engine
cpu: ""
# -- Memory limits for the io-engine
memory: "1Gi"
# -- Hugepage size available on the nodes
hugepages2Mi: "2Gi"
requests:
# -- Cpu requests for the io-engine
cpu: ""
# -- Memory requests for the io-engine
memory: "1Gi"
# -- Hugepage size available on the nodes
hugepages2Mi: "2Gi"
etcd:
env:
# seeing CrashLoopBackOff, need to add debugging logging for analysis. Remove later!
ETCD_LOG_LEVEL: debug
# Pod labels; okay to remove the openebs logging label if required
podLabels:
app: etcd
openebs.io/logging: "true"
# -- Number of replicas of etcd
replicaCount: 3
# Kubernetes Cluster Domain
clusterDomain: cluster.local
# TLS authentication for client-to-server communications
# ref: https://etcd.io/docs/current/op-guide/security/
client:
secureTransport: false
# TLS authentication for server-to-server communications
# ref: https://etcd.io/docs/current/op-guide/security/
peer:
secureTransport: false
# Enable persistence using Persistent Volume Claims
persistence:
# -- If true, use a Persistent Volume Claim. If false, use emptyDir.
enabled: true
# -- Will define which storageClass to use in etcd's StatefulSets
# a `manual` storageClass will provision a hostpath PV on the same node
# an empty storageClass will use the default StorageClass on the cluster
# storageClass: ""
storageClass: manual # SLX: need to set this to manual in order to bootstrap, update to Mayastor later
# -- Volume size
size: 2Gi
# -- PVC's reclaimPolicy
reclaimPolicy: "Delete"
# -- Use a PreStop hook to remove the etcd members from the etcd cluster on container termination
# Ignored if lifecycleHooks is set or replicaCount=1
removeMemberOnContainerTermination: false
# -- AutoCompaction
# Since etcd keeps an exact history of its keyspace, this history should be
# periodically compacted to avoid performance degradation
# and eventual storage space exhaustion.
# Auto compaction mode. Valid values: "periodic", "revision".
# - 'periodic' for duration based retention, defaulting to hours if no time unit is provided (e.g. 5m).
# - 'revision' for revision number based retention.
autoCompactionMode: revision
# -- Auto compaction retention length. 0 means disable auto compaction.
autoCompactionRetention: 100
extraEnvVars:
# -- Raise alarms when backend size exceeds the given quota.
- name: ETCD_QUOTA_BACKEND_BYTES
value: "8589934592"
auth:
rbac:
create: false
enabled: false
allowNoneAuthentication: true
# Init containers parameters:
# volumePermissions: Change the owner and group of the persistent volume mountpoint to runAsUser:fsGroup values from the securityContext section.
#
volumePermissions:
# chown the mounted volume; this is required if a statically provisioned hostpath volume is used
enabled: true
# extra debug information on logs
debug: false
initialClusterState: "new"
# Pod anti-affinity preset
# Ref: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity
podAntiAffinityPreset: "hard"
# etcd service parameters defines how the etcd service is exposed
service:
# K8s service type
type: ClusterIP
# etcd client port
port: 2379
# Specify the nodePort(s) value(s) for the LoadBalancer and NodePort service types.
# ref: https://kubernetes.io/docs/concepts/services-networking/service/#type-nodeport
#
nodePorts:
# Port from where etcd endpoints are accessible from outside cluster
clientPort: 31379
peerPort: ""
loki-stack:
# -- Enable loki log collection for our components
enabled: true
loki:
rbac:
# -- Create rbac roles for loki
create: true
pspEnabled: false
# -- Enable loki installation as part of loki-stack
enabled: true
# Install loki with persistence storage
persistence:
# -- Enable persistence storage for the logs
enabled: true
# -- StorageClass for Loki's centralised log storage
# empty storageClass implies cluster default storageClass & `manual` creates a static hostpath PV
# storageClassName: ""
storageClassName: manual # SLX: need to set this to manual in order to bootstrap, update to Mayastor later
# -- PVC's ReclaimPolicy, can be Delete or Retain
reclaimPolicy: "Delete"
# -- Size of Loki's persistence storage
size: 10Gi
# loki process run & file permissions, required if sc=manual
securityContext:
fsGroup: 1001
runAsGroup: 1001
runAsNonRoot: false
runAsUser: 1001
# initContainers to chown the static hostpath PV by 1001 user
initContainers:
- command: ["/bin/bash", "-ec", "chown -R 1001:1001 /data"]
image: docker.io/bitnami/bitnami-shell:10
imagePullPolicy: IfNotPresent
name: volume-permissions
securityContext:
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /data
name: storage
config:
# Compactor is a BoltDB(loki term) Shipper specific service that reduces the index
# size by deduping the index and merging all the files to a single file per table.
# Ref: https://grafana.com/docs/loki/latest/operations/storage/retention/
compactor:
# Dictates how often compaction and/or retention is applied. If the
# Compactor falls behind, compaction and/or retention occur as soon as possible.
compaction_interval: 20m
# If not enabled compactor will only compact table but they will not get
# deleted
retention_enabled: true
# The delay after which the compactor will delete marked chunks
retention_delete_delay: 1h
# Specifies the maximum quantity of goroutine workers instantiated to
# delete chunks
retention_delete_worker_count: 50
# Rentention period of logs is configured within the limits_config section
limits_config:
# configuring retention period for logs
retention_period: 168h
# Loki service parameters defines how the Loki service is exposed
service:
# K8s service type
type: ClusterIP
port: 3100
# Port where REST endpoints of Loki are accessible from outside cluster
nodePort: 31001
# promtail configuration
promtail:
rbac:
# create rbac roles for promtail
create: true
pspEnabled: false
# -- Enables promtail for scraping logs from nodes
enabled: true
# -- Disallow promtail from running on the master node
tolerations: []
config:
# -- The Loki address to post logs to
lokiAddress: http://{{ .Release.Name }}-loki:3100/loki/api/v1/push
snippets:
# Promtail will export logs to loki only based on based on below
# configuration, below scrape config will export only our services
# which are labeled with openebs.io/logging=true
scrapeConfigs: |
- job_name: {{ .Release.Name }}-pods-name
pipeline_stages:
- docker: {}
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels:
- __meta_kubernetes_pod_node_name
target_label: hostname
action: replace
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: keep
source_labels:
- __meta_kubernetes_pod_label_openebs_io_logging
regex: true
target_label: {{ .Release.Name }}_component
- action: replace
replacement: $1
separator: /
source_labels:
- __meta_kubernetes_namespace
target_label: job
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: pod
- action: replace
source_labels:
- __meta_kubernetes_pod_container_name
target_label: container
- replacement: /var/log/pods/*$1/*.log
separator: /
source_labels:
- __meta_kubernetes_pod_uid
- __meta_kubernetes_pod_container_name
target_label: __path__
obs:
callhome:
# -- Enable callhome
enabled: true
# -- Log level for callhome
logLevel: "info"
resources:
limits:
# -- Cpu limits for callhome
cpu: "100m"
# -- Memory limits for callhome
memory: "32Mi"
requests:
# -- Cpu requests for callhome
cpu: "50m"
# -- Memory requests for callhome
memory: "16Mi"
Awesome, thank you for clarifying @Titus-von-Koeller !
Btw rather than keeping the full values.yaml, you can keep a subset with just the changes:
helm install mayastor mayastor/mayastor -n mayastor --version 2.1.0 -f mod.yaml
.
You can also specify it directly, example:
helm install mayastor mayastor/mayastor -n mayastor --version 2.1.0 --set='etcd.persistence.storageClass=manual,loki-stack.loki.persistence.storageClassName=manual'
@tiagolobocastro Thanks a lot for that useful hint and getting back to me so quickly.
The values.yaml (taken from v2.0.1) was indeed at fault.
I'm glad that I'll be able to avoid that issue in the future with your proposed patching approach.
Thanks again for your help and building this amazing product. The level of quality in FOSS software these days in certain cases is just mind-boggling. I'll be on my feet to find more ways to contribute to the community as well.
I ran a demo with it yesterday, using the benchmark, and we were super impressed when we realized that the write throughput completely maxed out our network connection. Nice, this will be great for high performance ML applications (we need a new network interlink..) among other things.
I'm guessing my issue was similar: when I upgraded, I used the --reuse-values
flag.
First of all, thank you so much for your amazing work, it's greatly appreciated! Let me know if there is any info missing or if there is anything else I can help out with.
Bug description The issue is related to a failed Helmchart-based install of Mayastor v2.1.0 on a Talos Linux-based on-prem Kubernetes cluster with no pre-existing storage classes. The issue is unlikely to be related to Talos Linux, as I have afterwards successfully installed Maystor 2.0.1 on the same cluster, using the exact same steps as outlined below.
mayastor-agent-core:v2.0.1
container fails:--pool-commitment
isn't valid in this context.Expected behavior The expected behavior would be a clean and fully working Helmchart-based install of Mayastor v2.1.0 on the Talos Linux-based on-prem Kubernetes cluster with no pre-existing storage classes.
OS info
mayastor-agent-core
-image: v2.0.1 (as part of official Helm chart)To Reproduce Steps to reproduce the behavior (following the official install instructions):
Using gen config (if you're generating a worker config from scratch)
Patching an existing node
manual
for bootstrapping, as there are no storage classes available in the on-prem Talos cluster. Modifyvalues.yaml
as shown in the provided diff:@@ -385,7 +386,8 @@ loki-stack:
-- StorageClass for Loki's centralised log storage
storageClassName: ""
need to set this to manual in order to bootstrap, update to Mayastor later:
Pre-apply the
mayastor-namespace.yaml
file:This was necessary, as Talos linux enforces tight security policies by default, I found the relevant hint here.
Run the command
helm install mayastor mayastor/mayastor -n mayastor --version 2.1.0 -f values.yaml
.Check the pods status with
kubectl get pods -n mayastor
.CLI argument
--pool-commitment
isn't valid in this context.The faulty command seems to be:
Command Logs Please refer to the following command outputs, which describe the issue in detail: