openshift / origin

Conformance test suite for OpenShift
http://www.openshift.org
Apache License 2.0
8.48k stars 4.7k forks source link

when use loadrunner to give press on openshift-router, every several minutes, the request will got 200-300 error. #8773

Closed JamesJiang1024 closed 8 years ago

JamesJiang1024 commented 8 years ago

I am a openshift origin user, use v1.1.6 to prove of concept, but i found a problem, when use loadrunner to give press on openshift-router, every several minutes, the request will got 200-300 error. I change 2 parameters but that it seems does not work, resync_interval and reload_interval. It's a bit hard for me to find the code judge when to restart and resync haproxy which seems got bad effect on request flow.

Version

v 1.1.6

Steps To Reproduce
  1. Run a Simple Web App, like hello world In cluster
  2. Use LoadRunner to test that app
    Current Result
  3. every 10min there is some error occured
    Expected Result
  4. there no error, always 200
    Additional Information [The router log when error occured]

ha-router-logs.txt

roldancer commented 8 years ago

Hi, We have +/- 3% of errors (http code 503 + SSL Handshake errors) in our OSE's routers, we have more than 500 pods deployed, we are doing some troubleshooting.

knobunc commented 8 years ago

This fix https://bugzilla.redhat.com/show_bug.cgi?id=1320233 drastically reduces the number of reloads. Before it would reload periodically, even if there were no changes. Now it only reloads when there are changes.

The reason there are drops is because haproxy uses the PORT_REUSE flag on the socket to do the reload. There is a kernel bug that sometime packets can get dropped if they get sent to the old process, but not consumed before it terminates. Eventually that will get fixed.

There is a workaround to install an iptables rule, if needed. https://github.com/openshift/openshift-docs/pull/1987

But that is somewhat involved, and probably not necessary. The first fix usually resolves the problem.