Open JayChanggithub opened 1 year ago
Experiencing a similar duplication issue.
ts=2024-06-28T07:06:10.839582748Z caller=writer.go:157 level=debug name=receive component=receive component=receive-writer tenant=default-tenant msg="Duplicate sample for timestamp" lset="{__name__=\"extra_kube_persistentvolumeclaim_labels\", container=\"kube-state-metrics\", endpoint=\"http\", exported_namespace=\"enviro-master-civil-finances\", instance=\"10.131.2.153:8080\", job=\"kube-state-metrics\", label_enviro_group_com_storage_key=\"m14-82-w19-nb\", label_enviro_group_com_storage_parent_id=\"0ba87ea0-868f-4284-8da9-c154ea4d5ade\", label_enviro_group_com_storage_type=\"master\", namespace=\"monitoring\", persistentvolumeclaim=\"9399b50d-20d9-45ad-931a-ed4c86ad1d44\", pod=\"extra-kube-state-metrics-6b8f9b7868-lxr72\", prometheus=\"openshift-user-workload-monitoring/user-workload\", prometheus_replica=\"prometheus-user-workload-1\", service=\"extra-kube-state-metrics\"}" value=NaN timestamp=1719558370533
ts=2024-06-28T07:06:10.840190991Z caller=writer.go:253 level=info name=receive component=receive component=receive-writer tenant=default-tenant msg="Error on ingesting samples with different value but same timestamp" numDropped=1
ts=2024-06-28T07:06:10.841305363Z caller=handler.go:584 level=debug name=receive component=receive component=receive-handler tenant=default-tenant msg="failed to handle request" err="add 1 samples: duplicate sample for timestamp"
I only encounter those for persistentvolumeclaims
sent by kube-state-metrics. I run thanos receive in v 0.35.0, conf of which is:
args:
- receive
- --label=receive_cluster="preprod"
- --tsdb.path=/remote-write-data
- --debug.name=receive
- --log.level=debug
- --grpc-address=0.0.0.0:19891
- --http-address=0.0.0.0:18091
- --remote-write.address=0.0.0.0:19291
- --objstore.config-file=/etc/prometheus/objstore-config.yaml
Hi Team
I'd like to follow up symptoms in our landscape. We adopted thanos-receive to achieve collected come from each kubernetes prometheus metrics for a while. Then we through metrics which
count(count(up{app="prometheus"}) by (cluster))
to retrieve amounts of prometheusup
in time interval. Some prometheus for a while would be encountered error via clistern -n cic-system prometheus-prometheus-0 | grep 'component=remote level=error'
msgprometheus-prometheus-0 prometheus ts=2023-04-21T03:32:34.904Z caller=dedupe.go:112 component=remote level=error remote_name=11063c url=http://thanos-receive.thanos.xxxxxxxxxx/api/v1/receive msg="non-recoverable error" count=800 exemplarCount=0 err="server returned HTTP status 409 Conflict: forwarding request to endpoint thanos-receive-2.thanos-receive.thanos.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-receive-2.thanos-receive.thanos.svc.cluster.local:10901: add 1 samples: duplicate sample for timestamp"
caused the remote write metrics pending. We also updated remote_write in prometheus requests with HTTP headers to be multi-tsdb to receive local TSDB. Also each tenant folders be created. As far as i know should be resolved issue? right?Through cli to amounts of tenants(prometheus by each cluster). We have 5 replicas of receive pod. The below as once get tenants counts
For a while we through dashboard to get metrics
count(count(up{app="prometheus"}) by (cluster))
will decreased countsExpected results
Should be stability counts 56 not happend decreased. Unless some prometheus unhealthy.
Revision
manifest of receive
manifest of prometheus
PVC information
receive front-end ingress status were fine request success rate 100%