Closed syst0m closed 4 years ago
I would encourage you to update to a recent version of Prometheus as the TSDB code has been improved since v2.7.1. Also the last screenshot shows an increase in queries which can explain the increased CPU.
I'm closing it for now. If you have further questions, please use our user mailing list, which you can also search.
This seems to be about the Prometheus Operator itself and there also have been lots of improvements. One of those was a fix for a bug that caused lots of CPU usage. Please update the Prometheus Operator to v0.30+ too.
Thanks for the kitten video whoever used the accidentally posted slack hook URL :sweat_smile:
Bug Report
Deployed prometheus-operator helm chart to an EKS cluster. The prometheus instance is used for monitoring both the kubernetes workloads, and the CI/CD agents. The memory usage has been slowly increasing over time, it's ~19GB at the moment. Also, the CPU usage has grown, from 0.1->0.6 on average.
I used
tsdb
to analyze the prometheus db, it looks like the ephemeral nature of the CI/CD agents is causing high churn:High cardinality labels don't seem to be an issue:
What did you expect to see? Lower resource usage.
What did you see instead? Under which circumstances? High resource usage, with high churn, probably caused by ephemeral CI/CD agents.
Environment
Create default rules for monitoring the cluster
defaultRules: create: true rules: alertmanager: true etcd: true general: true k8s: true kubeApiserver: true kubePrometheusNodeAlerting: true kubePrometheusNodeRecording: true kubeScheduler: true kubernetesAbsent: true kubernetesApps: true kubernetesResources: true kubernetesStorage: true kubernetesSystem: true node: true prometheusOperator: true prometheus: true
global: rbac: create: true
Configuration for alertmanager
ref: https://prometheus.io/docs/alerting/alertmanager/
alertmanager:
Deploy alertmanager
enabled: true
Service account for Alertmanager to use.
ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/
serviceAccount: create: true name: "alertmanager"
Alertmanager configuration directives
ref: https://prometheus.io/docs/alerting/configuration/#configuration-file
https://prometheus.io/webtools/alerting/routing-tree-editor/
config: global: resolve_timeout: 5m route: group_by: ['job'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'null' routes:
match: alertname: Watchdog receiver: 'null' receivers:
Alertmanager template files to format alerts
ref: https://prometheus.io/docs/alerting/notifications/
https://prometheus.io/docs/alerting/notification_examples/
templateFiles: {} #
An example template:
template_1.tmpl: |-
{{ define "cluster" }}{{ .ExternalURL | reReplaceAll ".alertmanager\.(.)" "$1" }}{{ end }}
#
{{ define "slack.myorg.text" }}
{{- $root := . -}}
{{ range .Alerts }}
Alert: {{ .Annotations.summary }} -
{{ .Labels.severity }}
Cluster: {{ template "cluster" $root }}
Description: {{ .Annotations.description }}
Graph: <{{ .GeneratorURL }}|:chart_with_upwards_trend:>
Runbook: <{{ .Annotations.runbook }}|:spiral_note_pad:>
Details:
{{ range .Labels.SortedPairs }} • {{ .Name }}:
{{ .Value }}
{{ end }}
ingress: enabled: false
annotations: {}
labels: {}
Hosts must be provided if Ingress is enabled.
hosts: []
- alertmanager.domain.com
TLS configuration for Alertmanager Ingress
Secret must be manually created in the namespace
tls: []
- secretName: alertmanager-general-tls
hosts:
- alertmanager.example.com
Configuration for Alertmanager service
service: annotations: {} labels: {} clusterIP: ""
Port to expose on each node
Only used if service.type is 'NodePort'
nodePort: 30903
List of IP addresses at which the Prometheus server service is available
Ref: https://kubernetes.io/docs/user-guide/services/#external-ips
externalIPs: [] loadBalancerIP: "" loadBalancerSourceRanges: []
Service type
type: ClusterIP
If true, create a serviceMonitor for alertmanager
serviceMonitor: selfMonitor: true
Settings affecting alertmanagerSpec
ref: https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md#alertmanagerspec
alertmanagerSpec:
Standard object’s metadata. More info: https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#metadata
Metadata Labels and Annotations gets propagated to the Alertmanager pods.
podMetadata: {}
Image of Alertmanager
image: repository: quay.io/prometheus/alertmanager tag: v0.16.1
Secrets is a list of Secrets in the same namespace as the Alertmanager object, which shall be mounted into the
Alertmanager Pods. The Secrets are mounted into /etc/alertmanager/secrets/.
secrets: []
ConfigMaps is a list of ConfigMaps in the same namespace as the Alertmanager object, which shall be mounted into the Alertmanager Pods.
The ConfigMaps are mounted into /etc/alertmanager/configmaps/.
configMaps: []
Log level for Alertmanager to be configured with.
logLevel: info
Size is the expected size of the alertmanager cluster. The controller will eventually make the size of the
running cluster equal to the expected size.
replicas: 1
Time duration Alertmanager shall retain data for. Default is '120h', and must match the regular expression
[0-9]+(ms|s|m|h) (milliseconds seconds minutes hours).
retention: 120h
Storage is the definition of how storage will be used by the Alertmanager instances.
ref: https://github.com/coreos/prometheus-operator/blob/master/Documentation/user-guides/storage.md
storage: {}
volumeClaimTemplate:
spec:
storageClassName: gluster
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
selector: {}
The external URL the Alertmanager instances will be available under. This is necessary to generate correct URLs. This is necessary if Alertmanager is not served from root of a DNS name. string false
externalUrl:
The route prefix Alertmanager registers HTTP handlers for. This is useful, if using ExternalURL and a proxy is rewriting HTTP routes of a request, and the actual ExternalURL is still true,
but the server serves requests under a different route prefix. For example for use with kubectl proxy.
routePrefix: /
If set to true all actions on the underlying managed objects are not going to be performed, except for delete actions.
paused: false
Define which Nodes the Pods are scheduled on.
ref: https://kubernetes.io/docs/user-guide/node-selection/
nodeSelector: {}
Define resources requests and limits for single Pods.
ref: https://kubernetes.io/docs/user-guide/compute-resources/
resources: {}
requests:
memory: 400Mi
Pod anti-affinity can prevent the scheduler from placing Prometheus replicas on the same node.
The default value "soft" means that the scheduler should prefer to not schedule two replica pods onto the same node but no guarantee is provided.
The value "hard" means that the scheduler is required to not schedule two replica pods onto the same node.
The value "" will disable pod anti-affinity so that no anti-affinity rules will be configured.
podAntiAffinity: ""
If anti-affinity is enabled sets the topologyKey to use for anti-affinity.
This can be changed to, for example, failure-domain.beta.kubernetes.io/zone
podAntiAffinityTopologyKey: kubernetes.io/hostname
If specified, the pod's tolerations.
ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
tolerations: []
- key: "key"
operator: "Equal"
value: "value"
effect: "NoSchedule"
SecurityContext holds pod-level security attributes and common container settings.
This defaults to non root user with uid 1000 and gid 2000. *v1.PodSecurityContext false
ref: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/
securityContext: runAsNonRoot: true runAsUser: 1000 fsGroup: 2000
ListenLocal makes the Alertmanager server listen on loopback, so that it does not bind against the Pod IP.
Note this is only for the Alertmanager UI, not the gossip communication.
listenLocal: false
Containers allows injecting additional containers. This is meant to allow adding an authentication proxy to an Alertmanager pod.
containers: []
Priority class assigned to the Pods
priorityClassName: ""
AdditionalPeers allows injecting a set of additional Alertmanagers to peer with to form a highly available cluster.
additionalPeers: []
Using default values from https://github.com/helm/charts/blob/master/stable/grafana/values.yaml
grafana: enabled: true
adminPassword: "JgxzUa9ZpJsMOFQHKXu5"
Deploy default dashboards.
defaultDashboardsEnabled: true
grafana.ini: users: viewers_can_edit: false auth: disable_login_form: false disable_signout_menu: false auth.anonymous: enabled: true org_role: Viewer security: allow_embedding: true
list of datasources to insert/update depending
whats available in the database
https://grafana.com/docs/features/datasources/cloudwatch/#configure-the-datasource-with-provisioning
datasources: datasources.yaml: # The name is important, it seems... apiVersion: 1 datasources:
name: CloudWatch Infra type: cloudwatch jsonData: authType: arn defaultRegion: eu-west-1 assumeRoleArn: arn:aws:iam::${infrastructure_account_id}:role/${grafana_iam_role} editable: false
dashboardProviders: dashboardproviders.yaml: apiVersion: 1 providers:
extraInitContainers:
| set -x yq read "$PROVIDERS_YAML" "providers[*].options.path" \ | cut -d- -f2- \ | while read d; do mkdir -p "$d"; done env:
sidecar: dashboards: enabled: true label: grafana_dashboard searchNamespace: ALL folder: /habito/dashboards defaultFolderName: default datasources: enabled: true label: grafana_datasource
ingress:
If true, Prometheus Ingress will be created
enabled: false
Annotations for Prometheus Ingress
annotations: {}
kubernetes.io/ingress.class: nginx
kubernetes.io/tls-acme: "true"
Labels to be added to the Ingress
labels: {}
Hostnames.
Must be provided if Ingress is enable.
hosts:
- prometheus.domain.com
hosts: []
TLS configuration for prometheus Ingress
Secret must be manually created in the namespace
tls: []
- secretName: prometheus-general-tls
hosts:
- prometheus.example.com
extraConfigmapMounts: []
- name: certs-configmap
mountPath: /etc/grafana/ssl/
configMap: certs-configmap
readOnly: true
If true, create a serviceMonitor for grafana
serviceMonitor: selfMonitor: true
Component scraping the kube api server
kubeApiServer: enabled: true tlsConfig: serverName: kubernetes insecureSkipVerify: false
If your API endpoint address is not reachable (as in AKS) you can replace it with the kubernetes service
relabelings: []
- sourceLabels:
- __meta_kubernetes_namespace
- __meta_kubernetes_service_name
- __meta_kubernetes_endpoint_port_name
action: keep
regex: default;kubernetes;https
- targetLabel: address
replacement: kubernetes.default.svc:443
serviceMonitor: jobLabel: component selector: matchLabels: component: apiserver provider: kubernetes
Component scraping the kubelet and kubelet-hosted cAdvisor
kubelet: enabled: true namespace: kube-system
serviceMonitor:
Enable scraping the kubelet over https. For requirements to enable this see
Component scraping the kube controller manager
kubeControllerManager: enabled: true
If your kube controller manager is not deployed as a pod, specify IPs it can be found on
endpoints: []
- 10.141.4.22
- 10.141.4.23
- 10.141.4.24
If using kubeControllerManager.endpoints only the port and targetPort are used
service: port: 10252 targetPort: 10252 selector: k8s-app: kube-controller-manager
serviceMonitor:
Enable scraping kube-controller-manager over https.
Component scraping coreDns. Use either this or kubeDns
coreDns: enabled: true service: port: 9153 targetPort: 9153 selector: k8s-app: coredns
Component scraping kubeDns. Use either this or coreDns
kubeDns: enabled: false service: selector: k8s-app: kube-dns
Component scraping etcd
kubeEtcd: enabled: true
If your etcd is not deployed as a pod, specify IPs it can be found on
endpoints: []
- 10.141.4.22
- 10.141.4.23
- 10.141.4.24
Etcd service. If using kubeEtcd.endpoints only the port and targetPort are used
service: port: 4001 targetPort: 4001 selector: k8s-app: etcd-server
Configure secure access to the etcd cluster by loading a secret into prometheus and
specifying security configuration below. For example, with a secret named etcd-client-cert
serviceMonitor:
scheme: https
insecureSkipVerify: false
serverName: localhost
caFile: /etc/prometheus/secrets/etcd-client-cert/etcd-ca
certFile: /etc/prometheus/secrets/etcd-client-cert/etcd-client
keyFile: /etc/prometheus/secrets/etcd-client-cert/etcd-client-key
serviceMonitor: scheme: http insecureSkipVerify: false serverName: "" caFile: "" certFile: "" keyFile: ""
Component scraping kube scheduler
kubeScheduler: enabled: true
If your kube scheduler is not deployed as a pod, specify IPs it can be found on
endpoints: []
- 10.141.4.22
- 10.141.4.23
- 10.141.4.24
If using kubeScheduler.endpoints only the port and targetPort are used
service: port: 10251 targetPort: 10251 selector: k8s-app: kube-scheduler
serviceMonitor:
Enable scraping kube-controller-manager over https.
Component scraping kube state metrics
kubeStateMetrics: enabled: true
Configuration for kube-state-metrics subchart
kube-state-metrics: rbac: create: true podSecurityPolicy: enabled: true
Deploy node exporter as a daemonset to all nodes
nodeExporter: enabled: true
Use the value configured in prometheus-node-exporter.podLabels
jobLabel: jobLabel
serviceMonitor: {}
metric relabel configs to apply to samples before ingestion.
Configuration for prometheus-node-exporter subchart
prometheus-node-exporter: podLabels:
Add the 'node-exporter' label to be used by serviceMonitor to match standard common usage in rules and grafana dashboards
extraArgs:
Manages Prometheus and Alertmanager components
prometheusOperator: enabled: true
Service account for Alertmanager to use.
ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/
serviceAccount: create: true name: ""
Configuration for Prometheus operator service
service: annotations: {} labels: {} clusterIP: ""
Port to expose on each node
Only used if service.type is 'NodePort'
Additional ports to open for Prometheus service
ref: https://kubernetes.io/docs/concepts/services-networking/service/#multi-port-services
Loadbalancer IP
Only use if service.type is "loadbalancer"
Service type
NodepPort, ClusterIP, loadbalancer
Deploy CRDs used by Prometheus Operator.
createCustomResource: true
Customize CRDs API Group
crdApiGroup: monitoring.coreos.com
Attempt to clean up CRDs created by Prometheus Operator.
cleanupCustomResource: false
Labels to add to the operator pod
podLabels: {}
Assign a PriorityClassName to pods if set
priorityClassName: ""
Define Log Format
Use logfmt (default) or json-formatted logging
logFormat: logfmt
Decrease log verbosity to errors only
logLevel: error
If true, the operator will create and maintain a service for scraping kubelets
ref: https://github.com/coreos/prometheus-operator/blob/master/helm/prometheus-operator/README.md
kubeletService: enabled: true namespace: kube-system
Create a servicemonitor for the operator
serviceMonitor: selfMonitor: true
Resource limits & requests
resources: {}
limits:
cpu: 200m
memory: 200Mi
requests:
cpu: 100m
memory: 100Mi
Define which Nodes the Pods are scheduled on.
ref: https://kubernetes.io/docs/user-guide/node-selection/
nodeSelector: {}
Tolerations for use with node taints
ref: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
tolerations: []
- key: "key"
operator: "Equal"
value: "value"
effect: "NoSchedule"
Assign the prometheus operator to run on specific nodes
ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
affinity: {}
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/e2e-az-name
operator: In
values:
- e2e-az1
- e2e-az2
securityContext: runAsNonRoot: true runAsUser: 65534
Prometheus-operator image
image: repository: quay.io/coreos/prometheus-operator tag: v0.30.1 pullPolicy: IfNotPresent
Configmap-reload image to use for reloading configmaps
configmapReloadImage: repository: quay.io/coreos/configmap-reload tag: v0.0.1
Prometheus-config-reloader image to use for config and rule reloading
prometheusConfigReloaderImage: repository: quay.io/coreos/prometheus-config-reloader tag: v0.30.1
Set the prometheus config reloader side-car CPU limit. If unset, uses the prometheus-operator project default
configReloaderCpu: 100m
Set the prometheus config reloader side-car memory limit. If unset, uses the prometheus-operator project default
configReloaderMemory: 25Mi
Hyperkube image to use when cleaning up
hyperkubeImage: repository: k8s.gcr.io/hyperkube tag: v1.12.1 pullPolicy: IfNotPresent
Deploy a Prometheus instance
prometheus:
enabled: true
Service account for Prometheuses to use.
ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/
serviceAccount: create: true name: ""
Configuration for Prometheus service
service: annotations: {} labels: {} clusterIP: ""
rbac:
Create role bindings in the specified namespaces, to allow Prometheus monitoring
Configure pod disruption budgets for Prometheus
ref: https://kubernetes.io/docs/tasks/run-application/configure-pdb/#specifying-a-poddisruptionbudget
This configuration is immutable once created and will require the PDB to be deleted to be changed
https://github.com/kubernetes/kubernetes/issues/45398
podDisruptionBudget: enabled: false minAvailable: 1 maxUnavailable: ""
ingress: enabled: false annotations: {} labels: {}
serviceMonitor: selfMonitor: true
Settings affecting prometheusSpec
ref: https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md#prometheusspec
prometheusSpec:
additionalServiceMonitors: []
Name of the ServiceMonitor to create
- name: ""
level=warn ts=2019-10-02T09:55:41.617255426Z caller=scrape.go:835 component="scrape manager" scrape_pool=back-office/task-manager-accounts-api/0 target=http://10.0.2.188:9090/metrics msg="append failed" err="invalid metric type \"manager-accounts_requests_total counter\"" level=warn ts=2019-10-02T09:55:49.124152611Z caller=scrape.go:835 component="scrape manager" scrape_pool=back-office/submission-tracker-api/0 target=http://10.0.2.241:9090/metrics msg="append failed" err="invalid metric type \"tracker_requests_total counter\"" level=warn ts=2019-10-02T09:55:49.67682979Z caller=klog.go:86 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:300: watch of *v1.Endpoints ended with: too old resource version: 74574738 (74575856)" level=warn ts=2019-10-02T09:55:57.619654811Z caller=scrape.go:835 component="scrape manager" scrape_pool=back-office/task-manager-accounts-api/0 target=http://10.0.1.212:9090/metrics msg="append failed" err="invalid metric type \"manager-accounts_requests_total counter\"" level=warn ts=2019-10-02T09:56:03.09292573Z caller=scrape.go:835 component="scrape manager" scrape_pool=back-office/submission-tracker-api/0 target=http://10.0.3.194:9090/metrics msg="append failed" err="invalid metric type \"tracker_requests_total counter\"" level=warn ts=2019-10-02T09:56:11.617093048Z caller=scrape.go:835 component="scrape manager" scrape_pool=back-office/task-manager-accounts-api/0 target=http://10.0.2.188:9090/metrics msg="append failed" err="invalid metric type \"manager-accounts_requests_total counter\"" level=warn ts=2019-10-02T09:56:19.124170247Z caller=scrape.go:835 component="scrape manager" scrape_pool=back-office/submission-tracker-api/0 target=http://10.0.2.241:9090/metrics msg="append failed" err="invalid metric type \"tracker_requests_total counter\"" level=warn ts=2019-10-02T09:56:27.620129768Z caller=scrape.go:835 component="scrape manager" scrape_pool=back-office/task-manager-accounts-api/0 target=http://10.0.1.212:9090/metrics msg="append failed" err="invalid metric type \"manager-accounts_requests_total counter\"" level=warn ts=2019-10-02T09:56:33.093527932Z caller=scrape.go:835 component="scrape manager" scrape_pool=back-office/submission-tracker-api/0 target=http://10.0.3.194:9090/metrics msg="append failed" err="invalid metric type \"tracker_requests_total counter\"" level=warn ts=2019-10-02T09:56:41.61707015Z caller=scrape.go:835 component="scrape manager" scrape_pool=back-office/task-manager-accounts-api/0 target=http://10.0.2.188:9090/metrics msg="append failed" err="invalid metric type \"manager-accounts_requests_total counter\"" level=warn ts=2019-10-02T09:56:49.123975587Z caller=scrape.go:835 component="scrape manager" scrape_pool=back-office/submission-tracker-api/0 target=http://10.0.2.241:9090/metrics msg="append failed" err="invalid metric type \"tracker_requests_total counter\""