Gap in Read and Write when HA Prometheus replica changes

andreimiclea99 commented 7 months ago

Describe the bug

Every time the Prometheus replica changes in the HA tracker it leads to around 30 seconds gap in Mimir writes and Reads.

Another issue that I see when the Prometheus replica changes is that I see duplicated values for some metrics, for instance count(count(container_memory_usage_bytes{namespace="$namespace",container="$container",pod=~"$pod"}) by (instance)) with the output:

The scenario when the HA Tracker replica changes is when the elected Prometheus pod gets terminated, either because node termination, OOM or simply because of pod deletion.

There is no data lost, but we have some sensitive alerts in production that triggers when something like this happens.

Environment

The Mimir and the two Prometheus are running inside the Kubernetes, the Mimir version is 2.11, notice the same behaviour on 2.9 and 2.10. For deployment I used the mimir-distributed helm chart, version 5.0.0 .

Additional Context

Prometheus scrapping time is 30 seconds and when this happens i don't see any error logs or resources spikes in Mimir components. Not sure if relevant but I am not using Memcached for caching.

dimitarvdimitrov commented 7 months ago

The dropped 30s of data and the duplicated series sound like expected behaviour. Have you tried tuning these three settings?

  # (advanced) Update the timestamp in the KV store for a given cluster/replica
  # only after this amount of time has passed since the current stored
  # timestamp.
  # CLI flag: -distributor.ha-tracker.update-timeout
  [ha_tracker_update_timeout: <duration> | default = 15s]

  # (advanced) Maximum jitter applied to the update timeout, in order to spread
  # the HA heartbeats over time.
  # CLI flag: -distributor.ha-tracker.update-timeout-jitter-max
  [ha_tracker_update_timeout_jitter_max: <duration> | default = 5s]

  # (advanced) If we don't receive any samples from the accepted replica for a
  # cluster in this amount of time we will failover to the next replica we
  # receive a sample from. This value must be greater than the update timeout
  # CLI flag: -distributor.ha-tracker.failover-timeout
  [ha_tracker_failover_timeout: <duration> | default = 30s]

andreimiclea99 commented 7 months ago

@dimitarvdimitrov

I forgot to mention that I changed those values with:

      ha_tracker:
        ha_tracker_update_timeout: 5s
        ha_tracker_update_timeout_jitter_max: 5s
        ha_tracker_failover_timeout: 12s

The above screenshots are with the above values, it indeed reduced the gap to around 30 seconds. Before changing those values the gap was 45 seconds, so there was an improvement.

dimitarvdimitrov commented 7 months ago

I still think the lost scrape is the documented and expected behaviour. Do you see anything that's in the docs that doesn't match with what you observed?

There's also the grafana agent's clustering mode which doesn't rely on mimir's HA to failover, so it might be a bit less noticeable around failures https://grafana.com/docs/agent/latest/flow/concepts/clustering/

andreimiclea99 commented 6 months ago

@dimitarvdimitrov To be honest I am not sure if this is the expected behaviour for my case, but sounds like it.

Looks a bit better with these values:

distributor:
  extraArgs:
    distributor.ha-tracker.update-timeout-jitter-max: 2s
    distributor.ha-tracker.update-timeout: 2s
    distributor.ha-tracker.failover-timeout: 5s

Will try to do more tests.

What is a bit annoying is the fact that for e couple minutes it sees duplicated values for some metrics, when the Prometheus change happens.

andreimiclea99 commented 1 week ago

After further researched I managed to improve the behaviour by lowering querier.lookback-delta to 45s (from the 5m default), i use 30s scrape_interval for prometheus.

Also, I tunned the prometheus remote_write https://prometheus.io/docs/practices/remote_write/

As you can see in the attached images the behaviour is better, but not perfect.

grafana / mimir