thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.

https://thanos.io

Apache License 2.0

12.73k stars 2.04k forks source link

receive: CPU Saturation #2982

Closed enjoychaim closed 3 years ago

enjoychaim commented 3 years ago

Thanos, Prometheus and Golang version used:

Object Storage Provider:

apiVersion: v1
kind: ConfigMap
metadata:
  name: thanos-objectstorage
  namespace: monitoring
data:
  objectstorage.yaml: |
    type: S3
    config:
      endpoint: "s3.us-east-1.amazonaws.com"
      bucket: "sre-thanos"

What happened: Lost data, please see below

What you expected to happen: No data loss

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:

Receiver Logs

``` level=error ts=2020-08-05T01:32:26.233522015Z caller=handler.go:299 component=receive component=receive-handler err="replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error" level=error ts=2020-08-05T01:32:26.248889974Z caller=handler.go:299 component=receive component=receive-handler err="replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error" level=error ts=2020-08-05T01:32:26.414569932Z caller=handler.go:299 component=receive component=receive-handler err="3 errors: replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error" level=error ts=2020-08-05T01:32:26.418503985Z caller=handler.go:299 component=receive component=receive-handler err="replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error" level=error ts=2020-08-05T01:32:27.114444429Z caller=handler.go:299 component=receive component=receive-handler err="3 errors: replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error" level=error ts=2020-08-05T01:32:27.114662597Z caller=handler.go:299 component=receive component=receive-handler err="3 errors: replicate write request, endpoint thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error" level=error ts=2020-08-05T01:32:27.114900564Z caller=handler.go:299 component=receive component=receive-handler err="3 errors: replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error" level=error ts=2020-08-05T01:32:27.115106052Z caller=handler.go:299 component=receive component=receive-handler err="3 errors: replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error" ```

Anything else we need to know:

Environment:

thanos-receive-hashrings.json:

[
  {
    "hashring": "soft-tenants",
    "endpoints":
    [
      "thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901",
      "thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901",
      "thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901"
    ]
  }
]

receive args:

spec:
  nodeSelector:
    eks.amazonaws.com/nodegroup: thanos
  containers:
  - args:
    - receive
    - --grpc-address=0.0.0.0:10901
    - --http-address=0.0.0.0:10902
    - --remote-write.address=0.0.0.0:19291
    - --objstore.config-file=/etc/thanos/objectstorage.yaml
    - --tsdb.path=/var/thanos/receive
    - --tsdb.retention=15d
    - --tsdb.wal-compression
    - --receive.replication-factor=3
    - --label=receive_replica="$(NAME)"
    - --label=receive="true"
    - --receive.hashrings-file=/etc/thanos/thanos-receive-hashrings.json
    - --receive.local-endpoint=$(NAME).thanos-receive.monitoring.svc.cluster.local:10901
    env:
    - name: NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
    image: thanosio/thanos:v0.14.0

connectivity test in thanos-receive-2

[ec2-user@thanos-receive-2 ~]$ k exec -ti thanos-receive-2 -n monitoring -- sh
/ # nc -vz  thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901
thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901 (172.30.213.95:10901) open
/ # nc -vz  thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901
thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901 (172.30.252.6:10901) open
/ # nc -vz  thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901
thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901 (172.30.161.204:10901) open

brancz commented 3 years ago

But you can see this without gaps in Prometheus yes?

enjoychaim commented 3 years ago

But you can see this without gaps in Prometheus yes

There's no gaps in Prometheus.

Can you know the below reason ?

level=error ts=2020-08-05T01:32:26.233522015Z caller=handler.go:299 component=receive component=receive-handler err="replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"
level=error ts=2020-08-05T01:32:26.248889974Z caller=handler.go:299 component=receive component=receive-handler err="replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"
level=error ts=2020-08-05T01:32:26.414569932Z caller=handler.go:299 component=receive component=receive-handler err="3 errors: replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"
level=error ts=2020-08-05T01:32:26.418503985Z caller=handler.go:299 component=receive component=receive-handler err="replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"
level=error ts=2020-08-05T01:32:27.114444429Z caller=handler.go:299 component=receive component=receive-handler err="3 errors: replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"
level=error ts=2020-08-05T01:32:27.114662597Z caller=handler.go:299 component=receive component=receive-handler err="3 errors: replicate write request, endpoint thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"
level=error ts=2020-08-05T01:32:27.114900564Z caller=handler.go:299 component=receive component=receive-handler err="3 errors: replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"
level=error ts=2020-08-05T01:32:27.115106052Z caller=handler.go:299 component=receive component=receive-handler err="3 errors: replicate write request, endpoint thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded; replicate write request, endpoint thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901: replicate: context deadline exceeded" msg="internal server error"

brancz commented 3 years ago

I do not know. That error causes Prometheus to retry sending data though, so it's unlikely to be that. Could you try running a master image of Thanos? A couple of fixes for receive did land recently and haven't been released yet.

enjoychaim commented 3 years ago

I tried to modify receive.replication-factor to 1, and it never appeared again.

brancz commented 3 years ago

That's pretty dangerous though, that means if any one instance is not working your entire cluster is down.

enjoychaim commented 3 years ago

That's pretty dangerous though, that means if any one instance is not working your entire cluster is down.

I am experimenting with the master branch to see if there will be problems.

Is there a way to make a node offline and the cluster can still work normally when receive.replication-factor is still 1? By that way, The worst result will only lose part of the data.

brancz commented 3 years ago

Replication factor is exactly what allows us to have partial downtime in a cluster and still function, obviously we need to ensure that works reliably, I'll mark this as a bug tentatively as we continue to investigate the issue and hopefully resolve it :)

enjoychaim commented 3 years ago

But you can see this without gaps in Prometheus yes?

I changed the receive_refactor back to 3 and the problem will arise.

The first figure is in thanos. He lost his data.

The second figure is in promethues. His data is complete.

receive log

level=warn ts=2020-08-08T01:57:48.668899741Z caller=writer.go:91 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=11
level=warn ts=2020-08-08T01:57:48.669558959Z caller=writer.go:91 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=10
level=warn ts=2020-08-08T01:57:48.669979205Z caller=writer.go:91 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=4

receive config

    spec:
      nodeSelector:
        eks.amazonaws.com/nodegroup: thanos
      containers:
      - args:
        - receive
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --remote-write.address=0.0.0.0:19291
        - --objstore.config-file=/etc/thanos/objectstorage.yaml
        - --tsdb.path=/var/thanos/receive
        - --tsdb.retention=15d
        - --tsdb.wal-compression
        - --receive.replication-factor=3
        - --label=receive_replica="$(NAME)"
        - --label=receive="true"
        - --receive.hashrings-file=/etc/thanos/thanos-receive-hashrings.json
        - --receive.local-endpoint=$(NAME).thanos-receive.monitoring.svc.cluster.local:10901
        env:
        - name: NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        image: thanosio/thanos:v0.14.0
        livenessProbe:
          failureThreshold: 4
          httpGet:
            path: /-/healthy
            port: 10902
            scheme: HTTP
          periodSeconds: 30

Maybe my last log interception was incomplete. I guess the loss of data is caused by Error on ingesting out-of-order samples. In this case, use the master branch of receive or sidecar ？

enjoychaim commented 3 years ago

I'll use the master branch to verify the problem first.

Is the current receive not suitable for the production environment of too much data ？

enjoychaim commented 3 years ago

Master branch still has the problem.

receive config：

apiVersion: v1
kind: ConfigMap
metadata:
  name: thanos-objectstorage
  namespace: monitoring
data:
  objectstorage.yaml: |
    type: S3
    config:
      endpoint: "s3.us-east-1.amazonaws.com"
      bucket: "sre-thanos"
---

apiVersion: v1
kind: ConfigMap
metadata:
  name: thanos-receive-hashrings
  namespace: monitoring
data:
  thanos-receive-hashrings.json: |
    [
      {
        "hashring": "soft-tenants",
        "endpoints":
        [
          "thanos-receive-0.thanos-receive.monitoring.svc.cluster.local:10901",
          "thanos-receive-1.thanos-receive.monitoring.svc.cluster.local:10901",
          "thanos-receive-2.thanos-receive.monitoring.svc.cluster.local:10901"
        ]
      }
    ]
---

apiVersion: v1
kind: Service
metadata:
  name: thanos-receive
  namespace: monitoring
  labels:
    kubernetes.io/name: thanos-receive
spec:
  ports:
  - name: http
    port: 10902
    protocol: TCP
    targetPort: 10902
  - name: remote-write
    port: 19291
    protocol: TCP
    targetPort: 19291
  - name: grpc
    port: 10901
    protocol: TCP
    targetPort: 10901
  selector:
    kubernetes.io/name: thanos-receive
  clusterIP: None
---

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  labels:
    kubernetes.io/name: thanos-pdb
  name: thanos-pdb
  namespace: monitoring
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      kubernetes.io/name: thanos-receive
---

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    kubernetes.io/name: thanos-receive
  name: thanos-receive
  namespace: monitoring
spec:
  replicas: 3
  selector:
    matchLabels:
      kubernetes.io/name: thanos-receive
  serviceName: thanos-receive
  template:
    metadata:
      labels:
        kubernetes.io/name: thanos-receive
    spec:
      nodeSelector:
        eks.amazonaws.com/nodegroup: thanos
      containers:
      - args:
        - receive
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --remote-write.address=0.0.0.0:19291
        - --objstore.config-file=/etc/thanos/objectstorage.yaml
        - --tsdb.path=/var/thanos/receive
        - --tsdb.retention=15d
        - --tsdb.wal-compression
        - --receive.replication-factor=3
        - --label=receive_replica="$(NAME)"
        - --label=receive="true"
        - --receive.hashrings-file=/etc/thanos/thanos-receive-hashrings.json
        - --receive.local-endpoint=$(NAME).thanos-receive.monitoring.svc.cluster.local:10901
        env:
        - name: NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        # image: thanosio/thanos:v0.14.0
        image: thanosio/thanos:master-2020-08-07-9b578afb
        livenessProbe:
          failureThreshold: 4
          httpGet:
            path: /-/healthy
            port: 10902
            scheme: HTTP
          periodSeconds: 30
        name: thanos-receive
        ports:
        - containerPort: 10901
          name: grpc
        - containerPort: 10902
          name: http
        - containerPort: 19291
          name: remote-write
        readinessProbe:
          httpGet:
            path: /-/ready
            port: 10902
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 5
          failureThreshold: 60
        resources:
          limits:
            cpu: "1500m"
            memory: "14.5Gi"
          requests:
            cpu: "1500m"
            memory: "14.5Gi"
        volumeMounts:
        - mountPath: /var/thanos/receive
          name: data
          readOnly: false
        - mountPath: /etc/thanos/thanos-receive-hashrings.json
          name: thanos-receive-hashrings
          subPath: thanos-receive-hashrings.json
        - mountPath: /etc/thanos/objectstorage.yaml
          name: thanos-objectstorage
          subPath: objectstorage.yaml
        - mountPath: "/var/run/secrets/eks.amazonaws.com/serviceaccount/"
          name: aws-token
      terminationGracePeriodSeconds: 120
      volumes:
      - configMap:
          defaultMode: 420
          name: thanos-receive-hashrings
        name: thanos-receive-hashrings
      - configMap:
          name: thanos-objectstorage
        name: thanos-objectstorage
      - name: aws-token
        projected:
          sources:
          - serviceAccountToken:
              audience: "sts.amazonaws.com"
              expirationSeconds: 86400
              path: token
  volumeClaimTemplates:
  - metadata:
      labels:
        app.kubernetes.io/name: thanos-receive
      name: data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 100Gi

receive log:

level=warn ts=2020-08-09T01:52:39.184802971Z caller=writer.go:92 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=3
level=warn ts=2020-08-09T01:52:39.185293108Z caller=writer.go:92 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=6
level=warn ts=2020-08-09T01:52:40.430026634Z caller=writer.go:92 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=5

brancz commented 3 years ago

We are actively ingesting millions of active series and are not experiencing this, so I believe the culprit is somewhere else. I currently don't have time to look into this further, but maybe @kakkoyun @squat or @bwplotka do.

stale[bot] commented 3 years ago

Hello 👋 Looks like there was no activity on this issue for last 30 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Antiarchitect commented 3 years ago

Experiencing exactly the same erro with replication factor 3. @chaimch - any success on this?

h-abinaya28 commented 3 years ago

Did someone find a fix for this level=error ts=2020-11-12T05:22:34.30823915Z caller=handler.go:299 component=receive component=receive-handler err= msg="internal server error"

JoseRIvera07 commented 3 years ago

I'm actually facing this issues with thanos v0.16.0. level=error ts=2020-11-18T12:03:49.983380782Z caller=handler.go:331 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"

yuriydzobak commented 3 years ago

I'm actually facing this issues with thanos v0.16.0. level=error ts=2020-11-18T12:03:49.983380782Z caller=handler.go:331 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"

this error happens why you set the wrong --receive.local-endpoint= and hashiring has another endpoint

yuriydzobak commented 3 years ago

Did someone find a fix for this level=error ts=2020-11-12T05:22:34.30823915Z caller=handler.go:299 component=receive component=receive-handler err= msg="internal server error"

when prometheus is latest version the issue appires =(, this error doesn't show with prometheus version 22.1 in my case

Wander1024 commented 3 years ago

same issue level=error ts=2020-12-23T11:08:22.799370065Z caller=handler.go:331 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error" thanos 0.17.0

jmichalek132 commented 3 years ago

same issue with thanos v0.17.2

level=error ts=2021-01-04T13:12:57.13438741Z caller=handler.go:331 component=receive component=receive-handler err= msg="internal server error"
level=error ts=2021-01-04T13:12:57.235778067Z caller=handler.go:331 component=receive component=receive-handler err= msg="internal server error"
level=error ts=2021-01-04T13:12:57.337735689Z caller=handler.go:331 component=receive component=receive-handler err= msg="internal server error"
level=error ts=2021-01-04T13:12:57.439512509Z caller=handler.go:331 component=receive component=receive-handler err= msg="internal server error"

jmichalek132 commented 3 years ago

For me the error stopped happening after upgrading to build of the master (image: thanosio/thanos:master-2021-01-05-0aa07118). There are some changes to error handling in thanos receive since the last release. Notably this commit. However I am unsure whether this fixed the issue.

luizrojo commented 3 years ago

I have spent the whole afternoon today trying to figure this out, and this behavior seems to be an incompatibility with newer versions of Prometheus.

The error below only happens with Prometheus 2.23 or higher.

2021-01-20T17:12:24.577630-03:00 ip-10-184-125-9 thanos-receive[26164]: level=error ts=2021-01-20T20:12:24.577379617Z caller=handler.go:331 component=receive component=receive-handler err= msg="internal server error"

I just downgraded my Prometheus pod to version v2.22.2 and everything looks fine now.

I am running Thanos v0.17.2

Wander1024 commented 3 years ago

I have spent the whole afternoon today trying to figure this out, and this behavior seems to be an incompatibility with newer versions of Prometheus.

The error below only happens with Prometheus 2.23 or higher.

2021-01-20T17:12:24.577630-03:00 ip-10-184-125-9 thanos-receive[26164]: level=error ts=2021-01-20T20:12:24.577379617Z caller=handler.go:331 component=receive component=receive-handler err= msg="internal server error"

I just downgraded my Prometheus pod to version v2.22.2 and everything looks fine now.

I am running Thanos v0.17.2

I also got this error on Prometheus 1.19.3 and Thanos 0.17.0.

tdinucci commented 3 years ago

I've been hit by this issue too. I've not fully gotten to the bottom of it and there could easily be multiple reasons for it.

I've been testing out the Receiver in a local kind cluster and typically what I've been seeing is:

Receiver hashring forms and all is OK - currently I'm using a replication factor of 1
I kill one of the receivers
Other receivers spit out the log message which I'd expect to see (for a while at least) about not being able to connect to the dead pod
A replacement pod is provisioned however it either doesn't join the ring or it does very briefly and then leaves again.
Typically the entire ring becomes unstable, with all pods entering and then leaving quickly afterwards

Initially I thought the problem was fixed after I downgraded Prom from v2.24 to v2.15.2 however this may not have been the case since the problem described above has since appeared again.

Where I have noticed a definite pattern is with resources allocats. Since I'm running the cluster locally I'm keeping things pretty lean. Nothing is OOMing though.

If I give Receivers a limit of 256Mi memory and 100m CPU then the steps I've outlined above reliably reproduce the issue. I suspect that because CPU is so low it's probably the main culprit but I've not confirmed.

When the receivers are too constrained they seem to be taking too long to join the ring and then the entire ring becomes unstable. Sometimes killing all receivers at the same time so that they all restart around the same time sorts things.

I thought for a while that the live/ready probes were potentially playing a part too but I've removed them from the picture and can replicate simply by constraining resources.

bwplotka commented 3 years ago

Thanks for those items.

Looks like it's painful enough to give it more priority. 🤗

Initially I thought the problem was fixed after I downgraded Prom from v2.24 to v2.15.2 however this may not have been the case since the problem described above has since appeared again.

Yes, I doubt it have anything to do with Prometheus version, except, maybe it is something with remote write configuration (sharding, queues, bigger remote write batches etc). I would look on that.

If I give Receivers a limit of 256Mi memory and 100m CPU then the steps I've outlined above reliably reproduce the issue. I suspect that because CPU is so low it's probably the main culprit but I've not confirmed.

That is waaaaaaay too low. Receiver is like Prometheus, just even more CPU consuming because of replication and forwarding. CPU saturation is a common problem, but it's expected if you want to cope with load.

There are couple of things it would be amazing to clarify.

We recommend to run the receiver with replication. You can then survive load spikes and you can offload load across instances.
Log line is one thing, but data loss is another, especially if running multiple receivers and especially when Prometheus is configured correctly (it should retry until data is shipped). Can we double check what failed that retry did not work? This is important
Please ensure saturation of CPU. It's easy to spot when you will look on CPU time metric. You will see the line fluctuating over limit with weird spikes. Sometimes you could see even gaps in Prometheus scraping receiver if it's overloaded. It's recommended to give at least 1-2 CPU cores for typical setups.

IF saturation is the problem make sure you:

add more replicas
add more CPU
File issue with CPU pprof profiles - we can take a look on how to optimize CPU, maybe we can improve something. Make sure to also mention the write traffic you got at the moment of CPU profile 🤗

Does that help?

tdinucci commented 3 years ago

Thanks for the feedback 👍

Yep, I fully intend to allocate a lot more resources. At the moment I'm really just trying to understand potential failure modes before attempting to move towards production.

Increasing the replication factor is also something I'm intending on doing. One thing that is bothering me a bit though is that we've got HA Prom pairs writing to Thanos clusters and this will end up meaning each series is replicated 4x. I understand that there is experimental support for deduplicating series with identical data points but I'm worried it's maybe a little too bleeding edge at the moment. If this isn't done though there'll obviously be no guarantees that a single Receiver doesn't host all replicas for a particular series.

I've not spent time yet explicitly trying to cause data loss. I don't believe I'm seeing any though when a Receiver goes offline - haven't confirmed 100% yet but so far it's been looking like (with a replication factor == 1) points are just queued up in the Proms and delivered whenever the Receiver ring becomes stable again.

Something I have noticed though is that perhaps 50% of the time after a Receiver gets a SIGTERM and tries to flush its block it fails. It's not always the case but much of the time the logs state it's because the block has been closed (or something like that) and it can't read it. The data is still on disk though as far as the Receiver is concerned so not data loss.

Sorry, I'm probably getting too far off the point of this issue now :)

*Edit Have just seen the issue I mentioned about the block not being flushed to the object store. Reading the logs now I think they're maybe just saying that the process isn't going to accept new writes.

level=info ts=2021-01-26T20:39:38.7154373Z caller=main.go:168 msg="caught signal. Exiting." signal=terminated
level=warn ts=2021-01-26T20:39:38.722088Z caller=intrumentation.go:54 component=receive msg="changing probe status" status=not-ready reason=null
level=info ts=2021-01-26T20:39:38.723041Z caller=http.go:65 component=receive service=http/server component=receive msg="internal server is shutting down" err=null
level=info ts=2021-01-26T20:39:38.7238685Z caller=receive.go:319 component=receive msg="shutting down storage"
level=info ts=2021-01-26T20:39:38.7252897Z caller=multitsdb.go:152 component=receive component=multi-tsdb msg="flushing TSDB" tenant=default-tenant
level=info ts=2021-01-26T20:39:39.2325798Z caller=http.go:84 component=receive service=http/server component=receive msg="internal server is shutdown gracefully" err=null
level=info ts=2021-01-26T20:39:39.2327525Z caller=intrumentation.go:66 component=receive msg="changing probe status" status=not-healthy reason=null
level=info ts=2021-01-26T20:39:40.2242286Z caller=compact.go:500 component=receive component=multi-tsdb tenant=default-tenant msg="write block" mint=1611693148894 maxt=1611693577791 ulid=01EX06RPF56H854R67VK6Z08AY duration=1.4985802s
level=info ts=2021-01-26T20:39:40.5811752Z caller=head.go:824 component=receive component=multi-tsdb tenant=default-tenant msg="Head GC completed" duration=79.5139ms
level=info ts=2021-01-26T20:39:40.7326703Z caller=receive.go:323 component=receive msg="storage is flushed successfully"
level=info ts=2021-01-26T20:39:40.7329166Z caller=multitsdb.go:180 component=receive component=multi-tsdb msg="closing TSDB" tenant=default-tenant
level=info ts=2021-01-26T20:39:40.7375334Z caller=receive.go:329 component=receive msg="storage is closed"
level=info ts=2021-01-26T20:39:40.7378887Z caller=receive.go:535 component=receive component=uploader msg="uploading the final cut block before exiting"
level=info ts=2021-01-26T20:39:40.7396618Z caller=grpc.go:123 component=receive service=gRPC/server component=receive msg="internal server is shutting down" err=null
level=info ts=2021-01-26T20:39:40.7400401Z caller=grpc.go:136 component=receive service=gRPC/server component=receive msg="gracefully stopping internal server"
level=info ts=2021-01-26T20:39:40.7432215Z caller=grpc.go:149 component=receive service=gRPC/server component=receive msg="internal server is shutdown gracefully" err=null
level=warn ts=2021-01-26T20:39:40.9814783Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=1
level=warn ts=2021-01-26T20:39:40.9819714Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=1
level=warn ts=2021-01-26T20:39:40.985965Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=3
level=error ts=2021-01-26T20:39:43.564304Z caller=handler.go:331 component=receive component=receive-handler err="storing locally, endpoint thanos-prometheus-thanos-receiver-1.thanos-prometheus-thanos-receiver.thanos.svc.cluster.local:10901: commit samples: write to WAL: log series: write data/default-tenant/wal/00000001: file already closed" msg="internal server error"
level=error ts=2021-01-26T20:39:43.5654236Z caller=handler.go:331 component=receive component=receive-handler err="storing locally, endpoint thanos-prometheus-thanos-receiver-1.thanos-prometheus-thanos-receiver.thanos.svc.cluster.local:10901: commit samples: write to WAL: log series: write data/default-tenant/wal/00000001: file already closed" msg="internal server error"
level=error ts=2021-01-26T20:39:43.5680151Z caller=handler.go:331 component=receive component=receive-handler err="storing locally, endpoint thanos-prometheus-thanos-receiver-1.thanos-prometheus-thanos-receiver.thanos.svc.cluster.local:10901: commit samples: write to WAL: log series: write data/default-tenant/wal/00000001: file already closed" msg="internal server error"
level=error ts=2021-01-26T20:39:43.5748706Z caller=handler.go:331 component=receive component=receive-handler err="storing locally, endpoint thanos-prometheus-thanos-receiver-1.thanos-prometheus-thanos-receiver.thanos.svc.cluster.local:10901: commit samples: write to WAL: log series: write data/default-tenant/wal/00000001: file already closed" msg="internal server error"
level=error ts=2021-01-26T20:39:43.5823829Z caller=handler.go:331 component=receive component=receive-handler err="storing locally, endpoint thanos-prometheus-thanos-receiver-1.thanos-prometheus-thanos-receiver.thanos.svc.cluster.local:10901: commit samples: write to WAL: log series: write data/default-tenant/wal/00000001: file already closed" msg="internal server error"

Drewster727 commented 3 years ago

Just curious if anyone seen this issue with 0.18.0? Going to give it a shot early next week.

yuzijiang718 commented 3 years ago

Just curious if anyone seen this issue with 0.18.0? Going to give it a shot early next week.

HI,I am using v0.18.0 .I don't know if if there is any mistake in my config file.

The yaml file like this:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app.kubernetes.io/component: database-write-hashring
    app.kubernetes.io/instance: thanos-receive
    app.kubernetes.io/name: thanos-receive
    app.kubernetes.io/version: v0.18.0
  name: thanos-receive
  namespace: thanos
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/component: database-write-hashring
      app.kubernetes.io/instance: thanos-receive
      app.kubernetes.io/name: thanos-receive
  serviceName: thanos-receive
  template:
    metadata:
      labels:
        app.kubernetes.io/component: database-write-hashring
        app.kubernetes.io/instance: thanos-receive
        app.kubernetes.io/name: thanos-receive
        app.kubernetes.io/version: v0.18.0
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                  - thanos-receive
                - key: app.kubernetes.io/instance
                  operator: In
                  values:
                  - thanos-receive
              namespaces:
              - thanos
              topologyKey: kubernetes.io/hostname
            weight: 100
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                  - thanos-receive
                - key: app.kubernetes.io/instance
                  operator: In
                  values:
                  - thanos-receive
              namespaces:
              - thanos
              topologyKey: topology.kubernetes.io/zone
            weight: 100
      containers:
      - args:
        - receive
        - --log.level=info
        - --log.format=logfmt
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --remote-write.address=0.0.0.0:19291
        - --receive.replication-factor=2
        - --objstore.config=$(OBJSTORE_CONFIG)
        - --tsdb.path=/var/thanos/receive
        - --label=replica="$(NAME)"
        - --label=receive="true"
        - --tsdb.retention=15d
        - --receive.local-endpoint=$(NAME).thanos-receive.$(NAMESPACE).svc.cluster.local:10901
        - --receive.hashrings-file=/var/lib/thanos-receive/hashrings.json
        - |-
          --tracing.config="config":
            "sampler_param": 2
            "sampler_type": "ratelimiting"
            "service_name": "thanos-receive"
          "type": "JAEGER"
        env:
        - name: NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: OBJSTORE_CONFIG
          valueFrom:
            secretKeyRef:
              key: thanos.yaml
              name: thanos-objectstorage
        image: core.harbor.panda.com/thanos/thanos:v0.18.0
        livenessProbe:
          failureThreshold: 8
          httpGet:
            path: /-/healthy
            port: 10902
            scheme: HTTP
          periodSeconds: 30
        name: thanos-receive
        ports:
        - containerPort: 10901
          name: grpc
        - containerPort: 10902
          name: http
        - containerPort: 19291
          name: remote-write
        readinessProbe:
          failureThreshold: 20
          httpGet:
            path: /-/ready
            port: 10902
            scheme: HTTP
          periodSeconds: 5
        resources:
          limits:
            cpu: 2
            memory: 1024Mi
          requests:
            cpu: 0.5
            memory: 512Mi
        terminationMessagePolicy: FallbackToLogsOnError
        volumeMounts:
        - mountPath: /var/thanos/receive
          name: data
          readOnly: false
        - mountPath: /var/lib/thanos-receive
          name: hashring-config
      terminationGracePeriodSeconds: 900
      volumes:
      - configMap:
          name: hashring
        name: hashring-config
  volumeClaimTemplates:
  - metadata:
      labels:
        app.kubernetes.io/component: database-write-hashring
        app.kubernetes.io/instance: thanos-receive
        app.kubernetes.io/name: thanos-receive
      name: data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi

The hashing file like this:

{
    "hashrings.json": "[
            {
                \"hashring\": \"soft-tenants\",
                \"endpoints\": 
                [
                  \"thanos-receive-0.thanos-receive.thanos.svc.cluster.local:10901\",
                  \"thanos-receive-1.thanos-receive.thanos.svc.cluster.local:10901\",
                  \"thanos-receive-2.thanos-receive.thanos.svc.cluster.local:10901\"
                ]
            }
        ]
        "
}

And the receive log like this:

level=error ts=2021-02-01T02:46:59.386889118Z caller=handler.go:330 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error" level=error ts=2021-02-01T03:09:26.704816498Z caller=handler.go:330 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error" level=error ts=2021-02-01T03:26:15.085478835Z caller=handler.go:330 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"

kakkoyun commented 3 years ago

Hey @yuzijiang718, as suggested above have you tried to allocate more resources?

kimabd commented 3 years ago

Hey @yuzijiang718, as suggested above have you tried to allocate more resources?

Hello,I'm using same configuration and I have enough memory and CPU power(always less than limits). on minikube everything works well, but on real cluster there is always problems with liveness-probe checking every 5-20min, and a lot of error messages. At graph I have 98% of time empty data

here is my config:

apiVersion: apps/v1 kind: StatefulSet metadata: labels: app.kubernetes.io/component: database-write-hashring app.kubernetes.io/instance: thanos-receive app.kubernetes.io/name: thanos-receive app.kubernetes.io/version: v0.18.0 name: thanos-receive namespace: monitoring spec: replicas: 3 selector: matchLabels: app.kubernetes.io/component: database-write-hashring app.kubernetes.io/instance: thanos-receive app.kubernetes.io/name: thanos-receive serviceName: thanos-receive template: metadata: annotations: {} labels: app.kubernetes.io/component: database-write-hashring app.kubernetes.io/instance: thanos-receive app.kubernetes.io/name: thanos-receive app.kubernetes.io/version: v0.18.0 spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution:

podAffinityTerm: labelSelector: matchExpressions:

key: app.kubernetes.io/name operator: In values:

thanos-receive

key: app.kubernetes.io/instance operator: In values:

thanos-receive namespaces:

monitoring topologyKey: kubernetes.io/hostname weight: 100

podAffinityTerm: labelSelector: matchExpressions:

key: app.kubernetes.io/name operator: In values:

thanos-receive

key: app.kubernetes.io/instance operator: In values:

thanos-receive namespaces:

monitoring topologyKey: topology.kubernetes.io/zone weight: 100 containers:

args:

receive

--log.level=info

--log.format=logfmt

--grpc-address=0.0.0.0:10901

--http-address=0.0.0.0:10902

--remote-write.address=0.0.0.0:19291

--receive.replication-factor=3

--objstore.config=$(OBJSTORE_CONFIG)

--tsdb.path=/var/thanos/receive

--label=replica="$(NAME)"

--label=receive="true"

--tsdb.retention=14d

--receive.local-endpoint=$(NAME).thanos-receive.$(NAMESPACE).svc.cluster.local:10901

--receive.hashrings-file=/var/lib/thanos-receive/hashrings.json env:

name: NAME valueFrom: fieldRef: fieldPath: metadata.name

name: NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace

name: OBJSTORE_CONFIG valueFrom: secretKeyRef: key: thanos.yaml name: thanos-objstore-config image: quay.io/thanos/thanos:v0.18.0 livenessProbe: failureThreshold: 8 httpGet: path: /-/healthy port: 10902 scheme: HTTP periodSeconds: 30 name: thanos-receive ports:

containerPort: 10901 name: grpc

containerPort: 10902 name: http

containerPort: 19291 name: remote-write readinessProbe: failureThreshold: 20 httpGet: path: /-/ready port: 10902 scheme: HTTP periodSeconds: 5 resources: limits: cpu: 0.7 memory: 8000Mi requests: cpu: 0.123 memory: 100Mi terminationMessagePolicy: FallbackToLogsOnError volumeMounts:

mountPath: /var/thanos/receive name: data readOnly: false

mountPath: /var/lib/thanos-receive name: hashring-config terminationGracePeriodSeconds: 900 volumes:

configMap: name: thanos-receive-base name: hashring-config

emptyDir: {} name: data volumeClaimTemplates: []

and here is logs:

level=error ts=2021-02-19T08:32:38.742509105Z caller=handler.go:330 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error" level=warn ts=2021-02-19T08:32:38.74330643Z caller=writer.go:92 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=1

Readiness and Liveness probes periodically fails on both env(minikube and real cluster)

60m Warning Unhealthy pod/thanos-receive-2 Readiness probe failed: Get http://172.17.0.15:10902/-/ready: dial tcp 172.17.0.15:10902: connect: connection refused 60m Warning Unhealthy pod/thanos-receive-2 Liveness probe failed: Get http://172.17.0.15:10902/-/healthy: dial tcp 172.17.0.15:10902: connect: connection refused

tdinucci commented 3 years ago

I have an update on my experience with this issue that might help someone.

When running Receivers within a vanilla k8s cluster I saw only the issues I'd mentioned in my previous comment and generally speaking things are fine when Receivers have enough resources. I recently tried Receivers in a cluster which had Istio installed and the Receiver pods had Istio sidecars injected into them.

The TL;DR is that I saw the same behaviour that was mentioned in the initial post in this thread, i.e. periods of data loss. It's still too early to say with 100% certainty but today I have prevented Istio sidecars from being injected into Receiver pods and things so far are working as expected. I don't know if the OP was using a service mesh or not?

At this point I can only speculate as to what was going on when Istio was in the picture. During this test almost all Prom remote-write settings were at their defaults. What I observed however was:

A Prom would write to a Receiver (let's call this Receiver A)
Some of the data points that hit A would need routed to another Receiver (let's call this B)
Routing from A to B may fail and then A keeps retrying every 5 seconds. Under the hood Istio was also doing it's own retries.
Things were getting themselves into a situation where none of the retries the Receiver was aware of were succeeding, mostly due to timeouts but also some DNS issues. This would continue for an extended period, maybe 10s of minutes.
While this is happening, Prom did not appear to think any of the write requests were failing - no queue backlog was forming. I'm guessing this could be due to requests taking a while to timeout - not sure at the minute though.
Suddenly a Prom remote-write queue backlog would form (potentially due to requests timing out) and it would contain 10s to 100s of thousands of data points.
The queue would be immediately split from 5 to 1000 shards
Memory would spike and soon after the Prom would OOM
The Prom would restart and start replaying its WAL to the Receivers
By this point the Receivers head chunk had moved on and the mint was greater than the the points in the Prom queue
The Receivers starts dropping points
Once the old points are all dropped the new ones are ingested by the Receivers
At some later time everything above would repeat

There may well be some things that can be done to keep Receivers inside a service mesh, maybe disabling retries and setting short connection timeouts. I've not tried these things. I suppose my point though is that there are issues that can happen if you're using one.

milonjames commented 3 years ago

I am facing this issue with Thanos v0.20.1 and Prometheus v2.11.0 version. In my case Thanos receive PODs end up using all the CPU available in the node and throw these errors. Anyone managed to tackle the issue or find any workarounds ? @kakkoyun