Closed michael-burt closed 2 years ago
Hey @michael-burt sorry this did not get proper attention, I'd actually like to add this change. Do you have any objections?
Hi @matej-g , I have no objections to merging this, however there is a big caveat. In a high-churn environments where pods frequently get rescheduled, this PR causes the hashring configuration to change frequently. This causes a frequent redistribution of MTS across the hashring pods which can lead to a large increase in memory consumption. This scenario contributes to cascading failures where hashring pods get OOMKilled and removed from the hashring configuration, causing a redistribution of MTS which causes further load on remaining healthy pods, eventually causing them to OOMKill as well.
In the end, I found it was more stable to just run a static hashring config and remove this operator from our environment completely. I think we should include a caveat in the documentation describing these failure modes if we are going to merge this.
The failures I describe above were observed in an environment with ~50M MTS and 30-40 ingestor pods with a replication factor of 3.
Hey @michael-burt, thanks for your response, I actually looked at the code now and I get what you're saying. My initial impression was that this change affects only cases where we scale up and add new replicas to the hashring, in which case I think it makes sense to wait for the replicas to be ready before they are added.
However, for the second case where we check replica readiness on each sync, as you said I don't believe this is good idea in general. As you pointed, any intermittent issue where a receiver replica(s) goes down would mean hashring change, which would mean frequent hashring changes and problems you mentioned, which is also what we want to avoid. Even though there are some changes in upstream (to use more uniform algorithm for redistributing serie in the hashring as well as to not flush TSDB), I still don't think removing replicas on every un-readiness is good.
WDYT about only including the first change (on scaling up, wait for new replicas to be ready) but keep the behavior with regards to existing replicas? I'd be happy to take over this PR, possibly document this behavior a bit better as well.
@michael-burt I'll take this over, I'll probably open a new PR but I'll reuse some bits from this, thank you!
Sounds good, thanks @matej-g
This PR adds a CLI flag (
allow-only-ready-replicas
) which filters out pods that are not in the Ready condition from the hashring configuration. I have been testing this at scale for about 6 weeks and it does lead to a reduction in the frequency of replication and forwarding errors in the hashring.