thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.06k stars 2.09k forks source link

receive: memory spike when running in RouteOnly mode #6784

Open defreng opened 1 year ago

defreng commented 1 year ago

Thanos, Prometheus and Golang version used: v0.32.2

Object Storage Provider: none

What happened:

We are running a thanos receive router deployment in front of our receive ingestors, which currently handles about 50 reqs / second. During normal operation, the router pods handle this easily with 2 replicas, each using less than 70MB of memory each.

However, when one of the pods in the statefulset of the ingestor pods running behind that instance is temporarily unavailable - and then comes back online after 5 minutes or so, there is a spike in incoming remote write requests, as the clients are retrying successfully.

When that happens, within 3-5 seconds the receive router pod memory shoots up from 70MB to over 2000MB of memory usage (which is our limit and therefore results in an OOMKill)

What you expected to happen:

it's unclear to me why this is happening. What is all the memory used for in the receive router?

Is there a way to avoid this?

GiedriusS commented 1 year ago

What's your replication factor? In practice, it should be at least 3 to allow for downtime.

defreng commented 1 year ago

@GiedriusS as this deployment is only handling non-critical data, we only have a replication factor of 1 and accept data loss incase of issues

epeters-jrmngndr commented 11 months ago

Can you share some of the configuration you're using to create these Receive instances?

defreng commented 11 months ago

sure! this is our configuration:

          args:
            - receive
            - --log.level=info
            - --log.format=logfmt
            - --grpc-address=0.0.0.0:10901
            - --http-address=0.0.0.0:10902
            - --remote-write.address=0.0.0.0:19291
            - --receive.replication-factor=1
            - --receive.hashrings-file=/var/lib/thanos-receive/hashrings.json
            - --receive.hashrings-algorithm=ketama
            - --label=receive="true"
GiedriusS commented 11 months ago

Hi, unfortunately nothing can be done here. Replication factor of 1 doesn't allow any downtime and if Prometheus cannot send metrics then it retries hence increasing memory usage of Receive. There was movement to change how quorum works but that's outside the scope of this ticket.

hayk96 commented 11 months ago

@defreng have you checked the thnaos-receive-controller project? It should cover your problem related to replication-factor=1, and the system will avoid experiencing downtime for write metrics

defreng commented 9 months ago

Hi

in the meantime we updated our configuration to replication factor 3. And also allocated more resources to the router (8GB mem, which during normal operation is only utilized up to 5% or so).

However, in the event of some downtime on the ingestors (which unfortunately from time to time still happens), once the ingestors come back online, the routers are all overwhelmed with requests and dying due to an OOMKill (8GB, after 2-3 seconds).

Do you have any suspicion what mechanism is eating all the memory? Would it make sense to be able to limit the concurrent requests the router can handle?