kubermatic / mla

MLA (Monitoring, logging, alerting) solution for KKP.
Apache License 2.0
8 stars 10 forks source link

Improving installation of user MLA stack #133

Closed csengerszabo closed 1 year ago

csengerszabo commented 1 year ago

Reference: kubermatic/ps-team-flotilla#103

@stroebitzer commented on Wed Jul 06 2022

On working on the KKP Admin training I stumbled from one issue to the next on installing the User MLA stack into my KKP installation.

The current way of installing it is some kind of Alpha version. For providing a smooth experience to our customers we should enhance the installation process.

Maybe changing the way of installing stuff from some hack/deploy-seed.sh script towards our kubermatic-installer could be an option.

This ticket is about:


@talhalatiforakzai commented on Thu Jul 14 2022

Issues with installation of user mla

while deploying MLA stack through the helper script

This issue arrises with yq version 4.25.2 and to fix this edit line no 31 and 35 in hack/fetch-chart-dependencies.sh
line 31: chartname=$(yq read "$chartYAML" name) into chartname=$(yq '.name' "$chartYAML") line 35: for url in $(yq r "$chartYAML" dependencies --tojson | jq -r .[].repository); do into for url in $(yq '.dependencies.[].repository' "$chartYAML"); do

fetching charts

Error: parsing expression: Lexer error: could not match text starting at 1:1 failing at 1:4.

unmatched text: "rea"

Installing Minio

Release "minio" does not exist. Installing it now.

Error: An error occurred while checking for chart dependencies. You may need to run `helm dependency build` to fetch missing dependencies: found in Chart.yaml, but missing in charts/ directory: minio

Installing Minio Bucket Lifecycle Manager

Release "minio-lifecycle-mgr" does not exist. Installing it now.

W0713 10:41:42.545257 29382 warnings.go:70] batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob

W0713 10:41:43.212232 29382 warnings.go:70] batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob

NAME: minio-lifecycle-mgr

LAST DEPLOYED: Wed Jul 13 10:41:40 2022

NAMESPACE: mla

STATUS: deployed

REVISION: 1

TEST SUITE: None

Installing Grafana

Release "grafana" does not exist. Installing it now.

Error: An error occurred while checking for chart dependencies. You may need to run `helm dependency build` to fetch missing dependencies: found in Chart.yaml, but missing in charts/ directory: grafana

Installing Grafana Dashboards

configmap/grafana-dashboards-kkp-kubernetes created

configmap/grafana-dashboards-kubernetes-overview created

Installing Consul for Cortex

Release "consul" does not exist. Installing it now.

Error: An error occurred while checking for chart dependencies. You may need to run `helm dependency build` to fetch missing dependencies: found in Chart.yaml, but missing in charts/ directory: consul

Installing Cortex

configmap/cortex-runtime-config created

Release "cortex" does not exist. Installing it now.

Error: An error occurred while checking for chart dependencies. You may need to run `helm dependency build` to fetch missing dependencies: found in Chart.yaml, but missing in charts/ directory: cortex, memcached, memcached, memcached, memcached, memcached, memcached, memcached

Installing Loki

Release "loki-distributed" does not exist. Installing it now.

Error: An error occurred while checking for chart dependencies. You may need to run `helm dependency build` to fetch missing dependencies: found in Chart.yaml, but missing in charts/ directory: loki-distributed

Installing Alertmanager Proxy

Release "alertmanager-proxy" does not exist. Installing it now.

walk.go:74: found symbolic link in path: /home/talha/kubermatic/user-mla-issues/mla/charts/alertmanager-proxy/test/test.sh resolves to /home/talha/kubermatic/user-mla-issues/mla/hack/test-chart-rendering.sh. Contents of linked file included and used

NAME: alertmanager-proxy

LAST DEPLOYED: Wed Jul 13 10:41:51 2022

NAMESPACE: mla

STATUS: deployed

REVISION: 1

TEST SUITE: None

Partial installation of MLA stack incase of limited resources

MLA stack partially fails due to resource limitation due to which other resources that are dependent on them fails to start. Cleanup the installation and provision resources before retrying, maybe we can update the deploy script to check for resources availabiity before provisioning MLA stack.

MLA stack causes other workloads to crash & restart

If MLA stack is not installed on dedicated machine deployments then it causes other worloads to run out of mem/cpu, for this reason user should be informed and asked to use seperate MD with minimum specs to avoid any issues.

Pods are not scheduled on nodes provisioned specifically for user mla

I have created a machine deployment for user mla, so that all the workloads related to user mla are scheduled on these nodes, but for some reason all the other workloads gets scheduled fine except for

MD Values

    spec:
      metadata:
        labels:
          machinepool: run-stables-mla-az-d
          machine: run-stables-mla-az-b
          workload: infra-mla
          node-role.kubernetes.io/infra-mla: ""
      taints:
        - effect: NoSchedule
          key: node-role.kubernetes.io/infra-mla

MLA Values

cortex: 
  memcached-blocks-metadata:
      resources:
        requests:
          cpu: 5m
      serviceAccount:
        create: false
      tolerations:
        - effect: NoSchedule
          operator: Exists
          key: node-role.kubernetes.io/infra-mla
      nodeSelector:
        workload: infra-mla

    memcached-blocks-index:
      resources:
        requests:
          cpu: 5m
      serviceAccount:
        create: false
      tolerations:
        - effect: NoSchedule
          operator: Exists
          key: node-role.kubernetes.io/infra-mla
      nodeSelector:
        workload: infra-mla

    memcached-blocks:
      resources:
        requests:
          cpu: 5m
      serviceAccount:
        create: false
      tolerations:
        - effect: NoSchedule
          operator: Exists
          key: node-role.kubernetes.io/infra-mla
      nodeSelector:
        workload: infra-mla

Quick fix is that you should move these things outside of cortex context for nodeselector and toleration

cortex:
  .....
  .....

memcached-blocks:
  tolerations:
    - effect: NoSchedule
      operator: Exists
      key: node-role.kubernetes.io/infra-mla
  nodeSelector:
    workload: infra-mla

memcached-blocks-index:
  tolerations:
    - effect: NoSchedule
      operator: Exists
      key: node-role.kubernetes.io/infra-mla
  nodeSelector:
    workload: infra-mla

memcached-blocks-metadata:
  tolerations:
    - effect: NoSchedule
      operator: Exists
      key: node-role.kubernetes.io/infra-mla
  nodeSelector:
    workload: infra-mla

Consul chart fails to install incase of no default storage

The pods are in pending stage and when we describe pvc it shows no persistent volumes available for this claim and no storage class is set , basically when default storage is not set/applied on any storage class the consul chart rolls back the installation.

example solution

metadata:
  name: kubermatic-fast
  annotations:
    storageclass.beta.kubernetes.io/is-default-class: 'true'
csengerszabo commented 1 year ago

Note: consider shipping user cluster MLA within our forthcoming Applications feature.

csengerszabo commented 1 year ago
ewassef commented 1 year ago

This is a great issue and would be happy to help where possible. Another issue we ran into is the hard-coded Prometheus pod limits in the control plane. These get into a bad state and start failing when the WAL increases in size. 1Gi should be big enough but we regularly see it failing and have to kill the pod to delete the WAL

wurbanski commented 1 year ago

Check if it makes sense to use the grafana monitoring stack as referred to in #126 before we move in to work on the installer

we have decided after the initial research to focus on adding mla installation to kkp installer first, afterwards test and replace prometheus and promtail in the user cluster with grafana-agent instances: kubermatic/kubermatic#10971

Research about Tempo will be taken care of later (next release probably): kubermatic/kubermatic#10974