Open janario opened 4 months ago
Don't get too attached to the 96.58 %
availability in the failed cases.
In 1m
range with 500 users
in the scenario without error, it can reach to 57838
successful hits
While when rolling out only a total of 19330
hits, meaning that the client takes more time to handle error and ends up doing much less total requests.
Created some more reproducible scenarios
(I used the operator just because it was easier to integrate the tests.)
Logs in the good scenario with custom image and sleep: ✅
Unpacking siege (4.0.7-1+b1) ...
Setting up siege (4.0.7-1+b1) ...
Starting siege
New configuration template added to //.siege
Run siege -C to view the current settings in that file
deployment.apps/ha-collector restarted
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 2 old replicas are pending termination...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
deployment "ha-collector" successfully rolled out
deployment.apps/ha-collector restarted
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
deployment "ha-collector" successfully rolled out
deployment.apps/ha-collector restarted
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
deployment "ha-collector" successfully rolled out
{ "transactions": 3665,
"availability": 100.00,
"elapsed_time": 59.91,
"data_transferred": 0.07,
"response_time": 0.00,
"transaction_rate": 61.18,
"throughput": 0.00,
"concurrency": 0.12,
"successful_transactions": 0,
"failed_transactions": 0,
"longest_transaction": 0.06,
"shortest_transaction": 0.00
}
When using default image without sleep:
Setting up siege (4.0.7-1+b1) ...
Starting siege
New configuration template added to //.siege
Run siege -C to view the current settings in that file
deployment.apps/ha-collector restarted
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
deployment "ha-collector" successfully rolled out
deployment.apps/ha-collector restarted
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
deployment "ha-collector" successfully rolled out
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
deployment.apps/ha-collector restarted
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
[error] socket: unable to connect sock.c:282: Connection refused
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
deployment "ha-collector" successfully rolled out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
deployment.apps/ha-collector restarted
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
deployment "ha-collector" successfully rolled out
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
deployment.apps/ha-collector restarted
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 out of 2 new replicas have been updated...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
Waiting for deployment "ha-collector" rollout to finish: 1 old replicas are pending termination...
deployment "ha-collector" successfully rolled out
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused
{ "transactions": 1785,
"availability": 94.34,
"elapsed_time": 59.13,
"data_transferred": 0.03,
"response_time": 0.00,
"transaction_rate": 30.19,
"throughput": 0.00,
"concurrency": 0.03,
"successful_transactions": 0,
"failed_transactions": 107,
"longest_transaction": 0.04,
"shortest_transaction": 0.00
}
In both cases logs from the collector are fine:
2024-05-13T14:35:57.696Z info service@v0.99.0/service.go:192 Everything is ready. Begin running and processing data.
2024-05-13T14:35:57.696Z warn localhostgate/featuregate.go:63 The default endpoints for all servers in components will change to use localhost instead of 0.0.0.0 in a future version. Use the feature gate to preview the new default. {"feature gate ID": "component.UseLocalHostAsDefaultHost"}
2024-05-13T14:36:10.198Z info otelcol@v0.99.0/collector.go:281 Received signal from OS {"signal": "terminated"}
2024-05-13T14:36:10.198Z info service@v0.99.0/service.go:229 Starting shutdown...
2024-05-13T14:36:10.199Z info extensions/extensions.go:59 Stopping extensions...
2024-05-13T14:36:10.199Z info service@v0.99.0/service.go:243 Shutdown complete.
It is likely that this is happening bc our healthcheck extension needs improved: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/26661
@TylerHelmuth should I try with v2 or you mean that even with v2 it is still need improvements?
I mean the current version of extension/healthcheck
has some issues that results in readiness and liveliness not being 100% perfect. There is ongoing work to fix the issues, but it is slow going. https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/30673 is a new implementation (and you might be able to go to that branch and use the code in a custom build of the collector).
Got it, Thanks for the details
Later I can try to test with it and add the results here.
So far the workaround would be only custom image with sleep
? :-/
I gave healthcheckv2 a try
But not lucky :-/
I know it is in progress, but still not sure if it would be an issue in the healthcheck 🤔
Describe the bug
We noticed that while we rollout new changes on our collector it the applications report some warnning/error on connection refused.
Our collector is managed by the Operator and we noticed already that it doesn't have a readinessProbe, which we will share a fix for it https://github.com/open-telemetry/opentelemetry-operator/issues/2943
But even with the readiness while simulating a rollout and receiving many requests, some of them get dropped.
Steps to reproduce
Our scenario to reproduce it.
we are using siege to make lots of requests to the collector while we roll it out
Meanwhile siege is making requests we go in parallel and start a rollout
Pods gets replaced but after siege conclude we see that some requests were dropped
To fix that we have added a preStop lifecycle
With the
lifecycle
of a simple sleep 10s the results are much better with100%
of availability. 🕺 🙌But, I didn't want to use a custom image to have
sleep
command and I wonder if something at otel-collector could be done to make it work as expected.It seems that during the graceful shutdown something gets wrong and make requests to not be answered.
What did you expect to see?
While rolling out Pods and replicas>=2 not request should be lost.
What did you see instead?
Requests are lost when rolling out new collector pods and not possible to workaround with sleep command.
What version did you use?
0.98.0
What config did you use?
Environment
EKS 1.26 and kind 1.29
Additional context