grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.15k stars 535 forks source link

Discrepancy between Mimir and the actual data #5972

Open girishms-sentient opened 1 year ago

girishms-sentient commented 1 year ago

Discussed in https://github.com/grafana/mimir/discussions/5603

Originally posted by **girishms-sentient** July 27, 2023 The Mimir is configured and running, but the data I'm seeing in the Mimir Grafana dashboard does not match the actual Grafana for the cluster. Sometimes the displayed data is twice/thrice as accurate. **Actual Grafana dashboard:** unknown **Mimir Grafana:** ![unknown](https://github.com/grafana/mimir/assets/107147378/f48f8728-3a68-4e71-b00d-92ffa6cd3aed) ![unknown](https://github.com/grafana/mimir/assets/107147378/29061301-b48a-4277-8f9b-a9862bad55c6) **Here's the mimir config:** ``` ################ MIMIR CONFIGURATION ##################### mimir: structuredConfig: ingester: ring: final_sleep: 0s num_tokens: 512 tokens_file_path: /data/tokens heartbeat_period: 20s heartbeat_timeout: 60s unregister_on_shutdown: true kvstore: store: memberlist replication_factor: 3 zone_awareness_enabled: true memberlist: abort_if_cluster_join_fails: false compression_enabled: false join_members: - dns+{{ include "mimir.fullname" . }}-gossip-ring.{{ .Release.Namespace }}.svc.{{ .Values.global.clusterDomain }}:{{ include "mimir.memberlistBindPort" . }} advertise_addr: ${MY_POD_IP} limits: compactor_blocks_retention_period: 604800s ingestion_rate: 500000 max_global_series_per_metric: 9000000 max_global_series_per_user: 9000000 max_label_names_per_series: 60 distributor: ring: kvstore: store: memberlist compactor: data_dir: /data/compactor sharding_ring: heartbeat_period: 20s heartbeat_timeout: 60s kvstore: store: memberlist store_gateway: sharding_ring: heartbeat_period: 20s heartbeat_timeout: 60s zone_awareness_enabled: true kvstore: store: memberlist ruler: rule_path: /data/ruler poll_interval: 2s ring: heartbeat_period: 20s heartbeat_timeout: 60s kvstore: store: memberlist ############## MIMIR STORAGE ################ blocks_storage: backend: s3 bucket_store: max_chunk_pool_bytes: 12884901888 # 12GiB s3: endpoint: s3.us-west-2.amazonaws.com bucket_name: central-logging-mimir-bucket-block-storage insecure: true tsdb: dir: /data/tsdb alertmanager_storage: backend: s3 s3: endpoint: s3.us-west-2.amazonaws.com bucket_name: central-logging-mimir-bucket-alertmanager-storage ruler_storage: backend: s3 s3: endpoint: s3.us-west-2.amazonaws.com bucket_name: central-logging-mimir-bucket-ruler-storage ############################################################# global: extraEnv: - name: MY_POD_IP valueFrom: fieldRef: fieldPath: status.podIP serviceAccount: create: true name: mimir annotations: eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxxxx:role/xxxxxxxxxx alertmanager: persistentVolume: enabled: true storageClass: ebs-sc replicas: 2 resources: limits: memory: 1.4Gi requests: cpu: 1 memory: 1Gi statefulSet: enabled: true podAnnotations: eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxxxx:role/xxxxxxxxxx extraArgs: memberlist.bind-addr: ${MY_POD_IP} compactor: persistentVolume: size: 5Gi storageClass: ebs-sc podAnnotations: eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxxxx:role/xxxxxxxxxx extraArgs: memberlist.bind-addr: ${MY_POD_IP} distributor: replicas: 2 podAnnotations: eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxxxx:role/xxxxxxxxxx extraArgs: memberlist.bind-addr: ${MY_POD_IP} ingester: persistentVolume: size: 50Gi storageClass: ebs-sc replicas: 3 podAnnotations: eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxxxx:role/xxxxxxxxxx extraArgs: memberlist.bind-addr: ${MY_POD_IP} ruler: replicas: 1 podAnnotations: eks.amazonaws.com/role-arn: arn:aws:iam::xxxxxxxxxxx:role/xxxxxxxxxx extraArgs: memberlist.bind-addr: ${MY_POD_IP} store_gateway: persistentVolume: size: 10Gi storageClass: ebs-sc replicas: 3 extraArgs: memberlist.bind-addr: ${MY_POD_IP} querier: replicas: 1 extraArgs: memberlist.bind-addr: ${MY_POD_IP} admin-cache: enabled: true replicas: 2 chunks-cache: enabled: true replicas: 2 index-cache: enabled: true replicas: 1 metadata-cache: enabled: true results-cache: enabled: true replicas: 2 minio: enabled: false overrides_exporter: replicas: 1 resources: limits: memory: 128Mi requests: cpu: 100m memory: 128Mi query_frontend: replicas: 1 nginx: service: type: LoadBalancer annotations: service.beta.kubernetes.io/aws-load-balancer-ip-address-type: ipv4 service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: instance service.beta.kubernetes.io/aws-load-balancer-subnets: "xxxxxxxxx" service.beta.kubernetes.io/aws-load-balancer-type: external replicas: 1 resources: limits: memory: 731Mi requests: cpu: 1 memory: 512Mi # Grafana Enterprise Metrics feature related admin_api: replicas: 1 resources: limits: memory: 128Mi requests: cpu: 100m memory: 64Mi gateway: replicas: 1 resources: limits: memory: 731Mi requests: cpu: 1 memory: 512Mi ``` **Prometheus remote write:** ``` prometheus: prometheusSpec: remoteWrite: - url: http://xxxxxxxxxx.elb.us-west-2.amazonaws.com/api/v1/push ``` Is there anything I'm missing or is there a way to solve this issue?
girishms-sentient commented 1 year ago

This issue is blocking us from going into production with grafana Mimir. I had previously raised the discussion. https://github.com/grafana/mimir/discussions/5603

Any help on this would be really appreciated.

gonzalo-dibiase-webbeds commented 1 year ago

Like in the previous ticket, it looks you like you are scraping the same targets multiple times. I assume you are running kps and set 3 replicas for prometheus and they have the same external labels.

I suggest you read the docs about HA dedupliaction.

https://grafana.com/docs/mimir/latest/configure/configure-high-availability-deduplication/