thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.14k stars 2.1k forks source link

query: "Duplicate store address is provided" #7732

Open ovelicka opened 2 months ago

ovelicka commented 2 months ago

Thanos version used:

What happened: We are encountering an issue with duplicate stores being detected in our Thanos Query setup. The thanos_query_duplicated_store_addresses_total metric is increasing, indicating that the same store is being registered multiple times. Despite various changes and configurations, we haven't been able to resolve the issue.

Anything else we need to know: Query configuration:

- query
    - --log.format=json
    - --log.level=info
    - --query.replica-label=prometheus_replica
    - --query.auto-downsampling
    - --store.response-timeout=60s
    - --endpoint=dnssrv+thanos-store.monitoring.svc

We have tried specifying the endpoint in various formats like: --store=dnssrv+_grpc._tcp.thanos-store.monitoring.svc OR directly with static IP:port or DNS name of the endpoint but with no change..

Running nslookup directly from thanos-query pod instance returns something like this:

nslookup 10.20.20.166
    name = 10-20-20-166.thanos.monitoring.svc.cluster.local
    name = thanos-store-6w-0.thanos-store.monitoring.svc.cluster.local

Full logs to relevant components:

2024-09-12T11:11:29.636285634Z {"addr":"10.20.20.166:10901","caller":"query.go:822","level":"warn","msg":"Duplicate store address is provided","ts":"2024-09-12T11:11:29.636259629Z"}
dosubot[bot] commented 2 months ago

Hey there, @ovelicka! I'm here to help you with your Thanos issue. I'm a bot 🤖 that can help you solve bugs, answer questions, and become a contributor.

I found some similar issues and discussions that might help you resolve the problem with duplicate stores being detected in your Thanos Query setup:

  1. Odd store and Query deduplication behaviour - This discussion is still open and might provide some insights [1].
  2. found duplicate storeEndpoints producer (sidecar or ruler) - This issue is open and has no comments or solution yet [2].
  3. Dedup index reader requests - This issue is open with a suggested solution to wait for the first call to finish before making additional calls [3].
  4. query: metric type and scrape interval aware deduplication - This issue is open with a suggested solution to use Prometheus APIs for better deduplication [4].

To resolve the issue of duplicate stores being detected, ensure that the store addresses provided in your configuration are unique. The removeDuplicateEndpointSpecs function in Thanos handles this by logging a warning and incrementing the thanos_query_duplicated_store_addresses_total metric whenever a duplicate store address is detected.

Here is the relevant code snippet that handles the removal of duplicate store addresses:

func removeDuplicateEndpointSpecs(logger log.Logger, duplicatedStores prometheus.Counter, specs []*query.GRPCEndpointSpec) []*query.GRPCEndpointSpec {
    set := make(map[string]*query.GRPCEndpointSpec)
    for _, spec := range specs {
        addr := spec.Addr()
        if _, ok := set[addr]; ok {
            level.Warn(logger).Log("msg", "Duplicate store address is provided", "addr", addr)
            duplicatedStores.Inc()
        }
        set[addr] = spec
    }
    deduplicated := make([]*query.GRPCEndpointSpec, 0, len(set))
    for _, value := range set {
        deduplicated = append(deduplicated, value)
    }
    return deduplicated
}

Ensure that your configuration does not include duplicate store addresses. If you are using DNS-based discovery, make sure that the DNS records do not resolve to the same store addresses multiple times. The prepareEndpointSet function uses removeDuplicateEndpointSpecs to filter out duplicates:

func prepareEndpointSet(
    g *run.Group,
    logger log.Logger,
    reg *prometheus.Registry,
    dnsProviders []*dns.Provider,
    duplicatedStores prometheus.Counter,
    strictStores []string,
    strictEndpoints []string,
    endpointGroupAddrs []string,
    strictEndpointGroups []string,
    dialOpts []grpc.DialOption,
    unhealthyStoreTimeout time.Duration,
    endpointInfoTimeout time.Duration,
    queryConnMetricLabels ...string,
) *query.EndpointSet {
    endpointSet := query.NewEndpointSet(
        time.Now,
        logger,
        reg,
        func() (specs []*query.GRPCEndpointSpec) {
            // Add strict & static nodes.
            for _, addr := range strictStores {
                specs = append(specs, query.NewGRPCEndpointSpec(addr, true))
            }

            for _, addr := range strictEndpoints {
                specs = append(specs, query.NewGRPCEndpointSpec(addr, true))
            }

            for _, dnsProvider := range dnsProviders {
                var tmpSpecs []*query.GRPCEndpointSpec

                for _, addr := range dnsProvider.Addresses() {
                    tmpSpecs = append(tmpSpecs, query.NewGRPCEndpointSpec(addr, false))
                }
                tmpSpecs = removeDuplicateEndpointSpecs(logger, duplicatedStores, tmpSpecs)
                specs = append(specs, tmpSpecs...)
            }

            for _, eg := range endpointGroupAddrs {
                spec := query.NewGRPCEndpointSpec(fmt.Sprintf("thanos:///%s", eg), false, extgrpc.EndpointGroupGRPCOpts()...)
                specs = append(specs, spec)
            }

            for _, eg := range strictEndpointGroups {
                spec := query.NewGRPCEndpointSpec(fmt.Sprintf("thanos:///%s", eg), true, extgrpc.EndpointGroupGRPCOpts()...)
                specs = append(specs, spec)
            }

            return specs
        },
        dialOpts,
        unhealthyStoreTimeout,
        endpointInfoTimeout,
        queryConnMetricLabels...,
    )

    // Periodically update the store set with the addresses we see in our cluster.
    {
        ctx, cancel := context.WithCancel(context.Background())
        g.Add(func() error {
            return runutil.Repeat(5*time.Second, ctx.Done(), func() error {
                endpointSet.Update(ctx)
                return nil
            })
        }, func(error) {
            cancel()
        })
    }

    return endpointSet
}

Additionally, consider using the --store-strict flag introduced in Thanos Query, which ensures that only statically specified nodes are retained and always considered part of the active store set. This can help avoid issues with dynamic store nodes that might cause duplicates:

* Add a new flag to Thanos Query `--store-strict` which will only accept statically specified nodes and Thanos Query will always retain the last successfully retrieved information of them via the `Info()` gRPC method. Thus, they will always be considered as part of the active store set.

By ensuring that your store addresses are unique and properly configured, and by using the --store-strict flag if applicable, you can prevent the thanos_query_duplicated_store_addresses_total metric from increasing [5][6].

To continue talking to Dosu, mention @dosu.