kubermatic / mla

MLA (Monitoring, logging, alerting) solution for KKP.
Apache License 2.0
8 stars 10 forks source link

Mla large cluster stability fixes #81

Closed dharapvj closed 2 years ago

dharapvj commented 2 years ago

Supercedes #77. Fixes #76, #78

Changes to fix issues observed in large cluster deployment of user-cluster MLA.

This PR brings following fixes:

  1. If we have too many files in Minio - minio pod cannot mount the volume. (more details of the issue and fix suggested in https://github.com/kubermatic/mla/issues/76)
  2. Cortex Compactor fails to start if we have too many metrics in the storage. Cortex team has provided this suggestion to turn off the deleted_blocks_mark migration.
  3. We observed that some of pods were not getting scraped due to random limit of 30 labels defaulted in cortex chart. So relaxed this limit a bit to 40.
  4. We observed that we had overprovisioned almost all the MLA related pods with high CPU slices which was not getting used as much so we rationalized the CPU request which blocks CPU unnecessarily
  5. Loki distributed charts related prometheus scraping ports are fixed for all deployments now
rastislavs commented 2 years ago

/lgtm /approve

kubermatic-bot commented 2 years ago

LGTM label has been added.

Git tree hash: 2b0e1a6e94ef8e910bd235406f60a3aa797e768f

kubermatic-bot commented 2 years ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dharapvj, rastislavs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubermatic/mla/blob/main/OWNERS)~~ [rastislavs] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment