Open defreng opened 1 year ago
What's your replication factor? In practice, it should be at least 3 to allow for downtime.
@GiedriusS as this deployment is only handling non-critical data, we only have a replication factor of 1 and accept data loss incase of issues
Can you share some of the configuration you're using to create these Receive instances?
sure! this is our configuration:
args:
- receive
- --log.level=info
- --log.format=logfmt
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
- --remote-write.address=0.0.0.0:19291
- --receive.replication-factor=1
- --receive.hashrings-file=/var/lib/thanos-receive/hashrings.json
- --receive.hashrings-algorithm=ketama
- --label=receive="true"
Hi, unfortunately nothing can be done here. Replication factor of 1 doesn't allow any downtime and if Prometheus cannot send metrics then it retries hence increasing memory usage of Receive. There was movement to change how quorum works but that's outside the scope of this ticket.
@defreng have you checked the thnaos-receive-controller project? It should cover your problem related to replication-factor=1, and the system will avoid experiencing downtime for write metrics
Hi
in the meantime we updated our configuration to replication factor 3. And also allocated more resources to the router (8GB mem, which during normal operation is only utilized up to 5% or so).
However, in the event of some downtime on the ingestors (which unfortunately from time to time still happens), once the ingestors come back online, the routers are all overwhelmed with requests and dying due to an OOMKill (8GB, after 2-3 seconds).
Do you have any suspicion what mechanism is eating all the memory? Would it make sense to be able to limit the concurrent requests the router can handle?
Thanos, Prometheus and Golang version used: v0.32.2
Object Storage Provider: none
What happened:
We are running a thanos receive router deployment in front of our receive ingestors, which currently handles about 50 reqs / second. During normal operation, the router pods handle this easily with 2 replicas, each using less than 70MB of memory each.
However, when one of the pods in the statefulset of the ingestor pods running behind that instance is temporarily unavailable - and then comes back online after 5 minutes or so, there is a spike in incoming remote write requests, as the clients are retrying successfully.
When that happens, within 3-5 seconds the receive router pod memory shoots up from 70MB to over 2000MB of memory usage (which is our limit and therefore results in an OOMKill)
What you expected to happen:
it's unclear to me why this is happening. What is all the memory used for in the receive router?
Is there a way to avoid this?