Closed csengerszabo closed 1 year ago
Note: consider shipping user cluster MLA within our forthcoming Applications feature.
This is a great issue and would be happy to help where possible. Another issue we ran into is the hard-coded Prometheus pod limits in the control plane. These get into a bad state and start failing when the WAL increases in size. 1Gi should be big enough but we regularly see it failing and have to kill the pod to delete the WAL
Check if it makes sense to use the grafana monitoring stack as referred to in #126 before we move in to work on the installer
we have decided after the initial research to focus on adding mla installation to kkp installer first, afterwards test and replace prometheus and promtail in the user cluster with grafana-agent instances: kubermatic/kubermatic#10971
Research about Tempo will be taken care of later (next release probably): kubermatic/kubermatic#10974
Reference: kubermatic/ps-team-flotilla#103
@stroebitzer commented on Wed Jul 06 2022
On working on the KKP Admin training I stumbled from one issue to the next on installing the User MLA stack into my KKP installation.
The current way of installing it is some kind of Alpha version. For providing a smooth experience to our customers we should enhance the installation process.
Maybe changing the way of installing stuff from some
hack/deploy-seed.sh
script towards our kubermatic-installer could be an option.This ticket is about:
@talhalatiforakzai commented on Thu Jul 14 2022
Issues with installation of user mla
while deploying MLA stack through the helper script
This issue arrises with yq version 4.25.2 and to fix this edit line no 31 and 35 in
hack/fetch-chart-dependencies.sh
line 31:
chartname=$(yq read "$chartYAML" name)
intochartname=$(yq '.name' "$chartYAML")
line 35:for url in $(yq r "$chartYAML" dependencies --tojson | jq -r .[].repository); do
intofor url in $(yq '.dependencies.[].repository' "$chartYAML"); do
Partial installation of MLA stack incase of limited resources
MLA stack partially fails due to resource limitation due to which other resources that are dependent on them fails to start. Cleanup the installation and provision resources before retrying, maybe we can update the deploy script to check for resources availabiity before provisioning MLA stack.
MLA stack causes other workloads to crash & restart
If MLA stack is not installed on dedicated machine deployments then it causes other worloads to run out of mem/cpu, for this reason user should be informed and asked to use seperate MD with minimum specs to avoid any issues.
Pods are not scheduled on nodes provisioned specifically for user mla
I have created a machine deployment for user mla, so that all the workloads related to user mla are scheduled on these nodes, but for some reason all the other workloads gets scheduled fine except for
MD Values
MLA Values
Quick fix is that you should move these things outside of cortex context for nodeselector and toleration
Consul chart fails to install incase of no default storage
The pods are in pending stage and when we describe pvc it shows no persistent volumes available for this claim and no storage class is set , basically when default storage is not set/applied on any storage class the consul chart rolls back the installation.
example solution