grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.08k stars 523 forks source link

[querier and ruler] etcdserver: user name is empty #6792

Open NissesSenap opened 10 months ago

NissesSenap commented 10 months ago

Describe the bug

Running the querier together with a distributor that uses etcd gives the following error over and over again. I can see the same error in the ruler

ts=2023-12-01T14:33:39.431729675Z caller=etcd.go:243 level=error msg="watch error" key=ha-tracker/ err="rpc error: code = InvalidArgument desc = etcdserver: user name is empty"

To Reproduce

Steps to reproduce the behavior:

  1. Start Mimir (SHA or version) 2.10.0 with the helm chart defined below.
  2. Use the querier

Expected behavior

To my knowledge, the querier doesn't use etcd. So it shouldn't be there.

Environment

Additional Context

I use the mimir-distributed helm chart version 5.1.0

This is how a simplified version of my mimir values file look like.

serviceAccount:
  name: mimir
  annotations:
    iam.gke.io/gcp-service-account: grafana-mimir@project1-gke.iam.gserviceaccount.com

mimir:
  structuredConfig:
    limits:
      out_of_order_time_window: 5m
      max_global_series_per_user: 10000000
      ingestion_rate: 50000
      ingestion_burst_size: 1000000
      ruler_max_rules_per_rule_group: 50
      max_label_names_per_series: 40 # Due toreasons legacy metrics, can be changed later
      max_global_exemplars_per_user: 100000
      compactor_blocks_retention_period: 90d
      accept_ha_samples: true
      ha_cluster_label: cluster
      ha_replica_label: prometheus_replica
      cardinality_analysis_enabled: true

    distributor:
      ha_tracker:
        enable_ha_tracker: true
        kvstore:
          store: etcd
          etcd:
            endpoints:
              - mimir-etcd.mimir.svc.cluster.local:2379
            username: root
            password: ${ETCD_ROOT_PASSWORD}

    blocks_storage:
      backend: gcs
      gcs:
        bucket_name: mimir-blocks
    alertmanager_storage:
      backend: gcs
      gcs:
        bucket_name: mimir-alertmanager
    ruler_storage:
      backend: gcs
      gcs:
        bucket_name: mimir-ruler
    ingester_client:
      grpc_client_config:
        grpc_compression: snappy

runtimeConfig:
  overrides:
    staging:
      compactor_blocks_retention_period: 30d

alertmanager:
  replicas: 3
  statefulSet:
    enabled: true

compactor:

distributor:
  replicas: 3
  extraEnvFrom:
    - secretRef:
        name: mimir-etcd-auth

ingester:
  replicas: 3
  zoneAwareReplication:
    enabled: false

admin-cache:
  enabled: true
  replicas: 2

chunks-cache:
  enabled: true
  replicas: 2

index-cache:
  enabled: true
  replicas: 3

metadata-cache:
  enabled: true

results-cache:
  enabled: true
  replicas: 2

minio:
  enabled: false

overrides_exporter:
  enabled: false

nginx:
  enabled: false

gateway:
  enabledNonEnterprise: true
  replicas: 3

rollout_operator:
  enabled: true
NissesSenap commented 10 months ago

So I found a workaround for this.

I think the problem is that even though the config isn't used, value extraction happens in the mimir config.

I added

  extraEnvFrom:
    - secretRef:
        name: mimir-etcd-auth

To both the querier and the ruler, which solved the issue. After this, I don't get the error again. I have no idea why this error doesn't happen on all mimir deployments.

I will keep this open and hopefully someone that knows the code base better than me can find the issue/document it.

dimitarvdimitrov commented 10 months ago

Thanks for reporting this!

The querier uses some of the distributor code to find ingesters. As a result it ends up initializing the distributor module (DistributorService in the code). Since queriers (and all components in the Helm chart) use the same config, the distributor module in the querier pod ends up starting the HA tracker as well. I can't think of an easy way to fix this in code.

This should be mostly harmless. The querier will only have extra bandwidth following the HA Tracker keys in etcd, which are relatively low-volume anyways.

If you don't want to mount the secret on rulers, queriers, and query-frontends, then you can set this flag on each one of them individually in helm:

querier:
  extraArgs:
     distributor.ha-tracker.enable: false
ruler:
  extraArgs:
    ...
jmichalek132 commented 6 months ago

Hi, fyi I ran into this with distributor too when using the helm chart, my mistake was assuming the env variable will be created with the ETCD_ROOT_PASSWORD name with this config;

    extraEnvFrom:
      - name: ETCD_ROOT_PASSWORD
        secretRef:
          name: mimir-etcd
          key: password

Turns out it wasn't, it used the key (for the name of the env variable) from the secret. But the error message was misleading because it was complaining about the username being empty but it's actually set in the config. When the password wasn't "set" because of expecting it under different env variable.