open-telemetry / opentelemetry-operator

Kubernetes Operator for OpenTelemetry Collector
Apache License 2.0
1.22k stars 441 forks source link

Clarification w.r.t test case TestRelativelyEvenDistribution #2133

Open sfc-gh-akrishnan opened 1 year ago

sfc-gh-akrishnan commented 1 year ago

The consistent_hashing.go module in https://github.com/open-telemetry/opentelemetry-operator/blob/main/cmd/otel-allocator/allocation/consistent_hashing.go is using the consistent hash implementation from github.com/buraksezer/consistent.

The load value is initialized to 1.1.

Q1) Is there any plans to make these values configurable in the near future? Or is there a knob already available?

Q2) Following up, as my understanding goes reading upon the blogs, the expectation that in the worst case the most loaded value is only 10% deviated from the average load as Load value is set to 1.1. If that be the case, the test case TestRelativelyEvenDistribution at https://github.com/open-telemetry/opentelemetry-operator/blob/main/cmd/otel-allocator/allocation/consistent_hashing_test.go, tests if the load is 50% off and not 10%. What am I missing?

Appreciate the maintainers patience in helping out with an answer : )

Thank in advance

jaronoff97 commented 1 year ago

I don't think we'll be making it configurable, for now. Maybe once #1957 is merged we could consider it, but i'm not sure it's worth exposing those knobs for now. Do you have a use case for configuring this?

When I wrote that test, it was a bit flaky so I increased the delta for us to revisit these numbers in the future. If you want to mess around with the values / distribution / tests go for it :) I found that these values have resulted in a pretty even distribution in my clusters.

sfc-gh-akrishnan commented 1 year ago

I don't have a strong use-case. I wanted to play with the configuration on what makes the best sense in our cluster.

On the second item, I actually see quite a bit of variance. With 214 targets to scrape and with 3 collectors, the average load is expected to be ~71 targets per collector. With a 10% load imbalance allowed, I expected the distribution to be in the range [64, 78]. But the observation was different:

{'otel-collector-1': 77, 'otel-collector-2': 80, 'otel-collector-0': 57}

Thus I started reading the test case, and the test was rather for 50% variance. I wanted to understand what was the missing part of my understanding. With increasing collectors, I see the variance reducing. Do you happen to have any insights from experience to share to me here? Please also let me know if there is a flaw in understanding

jaronoff97 commented 1 year ago

honestly, i'm not sure. We're using a library to manage the distribution and I would probably check there for more information.