traefik hangs - stops handling requests

r0bj commented 8 years ago

traefik version: 1.0.2

Set open files limit to 1000000: ulimit -n 1000000

traefik config:

traefikLogsFile = "/var/log/traefik.log"
accessLogsFile = "/var/log/traefik-access.log"
logLevel = "DEBUG"

defaultEntryPoints = ["http"]

[entryPoints]
  [entryPoints.http]
  address = ":8000"

[retry]

[web]
address = ":8080"

[file]

[backends]
  [backends.backend1]
    [backends.backend1.servers.server1]
    url = "http://localhost:80"

[frontends]
  [frontends.frontend1]
  backend = "backend1"
    [frontends.frontend1.routes.test_1]
    rule = "Host:test-nginx.example.net"

# curl -s http://localhost:8080/api | jq .
{
  "file": {
    "frontends": {
      "frontend1": {
        "priority": 0,
        "routes": {
          "test_1": {
            "rule": "Host:test-nginx.example.net"
          }
        },
        "backend": "backend1",
        "entryPoints": [
          "http"
        ]
      }
    },
    "backends": {
      "backend1": {
        "loadBalancer": {
          "method": "wrr"
        },
        "servers": {
          "server1": {
            "weight": 0,
            "url": "http://localhost:80"
          }
        }
      }
    }
  }
}

After start everything works:

# curl -svo /dev/null -H "Host:test-nginx.example.net" http://localhost:8000
* Rebuilt URL to: http://localhost:8000/
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.35.0
> Accept: */*
> Host:test-nginx.example.net
>
< HTTP/1.1 200 OK
< Content-Length: 612
< Content-Type: text/html
< Date: Fri, 02 Sep 2016 11:56:33 GMT
< Last-Modified: Tue, 04 Mar 2014 11:46:45 GMT
* Server nginx/1.4.6 (Ubuntu) is not blacklisted
< Server: nginx/1.4.6 (Ubuntu)
<
{ [data not shown]
* Connection #0 to host localhost left intact

Then lets test it with wrk: wrk -t30 -c400 -d30s -H "Host: test-nginx.example.net" http://localhost:8000

After some time (one or few attempts) traefik is unresponsive on port 8000:

# curl -svo /dev/null -H "Host:test-nginx.example.net" http://localhost:8000
* Rebuilt URL to: http://localhost:8000/
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.35.0
> Accept: */*
> Host:test-nginx.example.net

So traffic is no longer processed.

Health API is also unresponsive:

# curl -svo /dev/null http://localhost:8080/health
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /health HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:8080
> Accept: */*

Whats interesting dashboard is responsive (but without data):

# curl -svo /dev/null http://localhost:8080/dashboard/
* Hostname was NOT found in DNS cache
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /dashboard/ HTTP/1.1
> User-Agent: curl/7.35.0
> Host: localhost:8080
> Accept: */*
>
< HTTP/1.1 200 OK
< Accept-Ranges: bytes
< Content-Length: 1388
< Content-Type: text/html; charset=utf-8
< Last-Modified: Tue, 02 Aug 2016 17:29:47 GMT
< Date: Fri, 02 Sep 2016 12:03:29 GMT
<
{ [data not shown]
* Connection #0 to host localhost left intact

Traefik access logs just stop writing:

127.0.0.1 - - [02/Sep/2016:17:01:44 +0200] "GET / HTTP/1.1" 200 612 "" "" 23814 "frontend1" "http://localhost:80" 218.853406ms
127.0.0.1 - - [02/Sep/2016:17:01:44 +0200] "GET / HTTP/1.1" 200 612 "" "" 23813 "frontend1" "http://localhost:80" 219.370314ms
127.0.0.1 - - [02/Sep/2016:17:01:44 +0200] "GET / HTTP/1.1" 200 612 "" "" 23812 "frontend1" "http://localhost:80" 219.933964ms

Logs (severity DEBUG) shows in log file all the time even for requests without reply:

time="2016-09-02T17:02:20+02:00" level=debug msg="Round trip: http://localhost:80, code: 200, duration: 9.730448ms"
time="2016-09-02T17:02:20+02:00" level=debug msg="Round trip: http://localhost:80, code: 200, duration: 10.423115ms"
time="2016-09-02T17:02:20+02:00" level=debug msg="Round trip: http://localhost:80, code: 200, duration: 10.664671ms"

strace during attempt to send request to traefik (curl -svo /dev/null -H "Host:test-nginx.example.net" http://localhost:8000): https://gist.github.com/r0bj/b618c74b1bc0db5c11f78db08c34fc15

So it seems that request hits backend but response isn't sent to original sender.

There are many connections in CLOSE_WAIT status: https://gist.github.com/r0bj/c647c76fe65a562ffd2e024e11a260cd

Restart treafik daemon fixes this issue.

It's easy to replicate this issue:

start vagrant host with ubuntu from official ubuntu/trusty64 image with default settings
ulimit -n 1000000
install nginx
start traefik with above config
hit it with wrk benchmark one or more times

One can also replicate this issue with sending wrk requests to non existing backend (resulting 404): wrk -t30 -c400 -d30 http://localhost:8000/

emilevauge commented 8 years ago

@r0bj for high performance, I recommend to set a higher value for MaxIdleConnsPerHost:

# If non-zero, controls the maximum idle (keep-alive) to keep per-host.  If zero, DefaultMaxIdleConnsPerHost is used.
# If you encounter 'too many open files' errors, you can either change this value, or change `ulimit` value.
#
# Optional
# Default: http.DefaultMaxIdleConnsPerHost
#
# MaxIdleConnsPerHost = 200

Can you try with this change (for example 1000)?

r0bj commented 8 years ago

Changing MaxIdleConnsPerHost value to 1000 didn't fix the issue. Daemon stop handling traffic.

r0bj commented 8 years ago

Just to be clear, traefik stays in freeze state even after wrk benchmark is over.

emilevauge commented 8 years ago

Hum, this is really weird... There must be a regression here. I will investigate on this.

emilevauge commented 8 years ago

I cannot reproduce this issue. I used your toml file and launched several times wrk -t30 -c400 -d30 -H "Host:test-nginx.example.net" http://localhost:8000/ but I still don't have any error... But Iused traefik master version. I will try with v.1.0.2 later. Maybe you could try disabling retry in the toml file?

r0bj commented 8 years ago

I've disabled retry but result was the same.

emilevauge commented 8 years ago

@r0bj I still cannot reproduce this issue on my laptop, even using v1.0.2 :confused: Traefik continues to respond after benchmarks. Which architecture are you using? amd64? Are you on bare metal or on a VM? @containous/traefik can anyone try to reproduce this at home?

r0bj commented 8 years ago

I hit this issue on bare metal amd64 host. Then I try to reproduce this on vagrant/virtual box VM (vanilla ubuntu/trusty64) with success.

abhishekamralkar commented 8 years ago

I am having the same issue, I end up writing a script which check Traefik log files and if the log file is older than 3 min it reloads Traefik.

emilevauge commented 8 years ago

@abhishekamralkar Which version are you using? Is it something new? As I still cannot reproduce this behavior, I need as much information as possible to investigate :)

SvenAbels commented 8 years ago

I can confirm this with the latest release of traefik using wrk -t40 -c400 -d30 http://...

using e.g. wrk -t10 -c40 -d30 http://... works still fine.

I was able to reproduce this on 2 physical machines with Ubuntu 16.04 running Traefik in Docker 1.12.1

As soon as the error orrurs Traefik stops reacting - even after wrk is finised. In the logs of traefik the following appears:

...
time="2016-09-19T20:50:14Z" level=warning msg="Error forwarding to http://httpd:80, err: EOF"
time="2016-09-19T20:50:14Z" level=warning msg="Error forwarding to http://httpd:80, err: EOF"
time="2016-09-19T20:50:14Z" level=warning msg="Error forwarding to http://httpd:80, err: EOF"
time="2016-09-19T20:50:14Z" level=warning msg="Error forwarding to http://httpd:80, err: EOF"
time="2016-09-19T20:50:14Z" level=warning msg="Error forwarding to http://httpd:80, err: EOF"

http://httpd:80 is the backend that I have configured. Despite the error message, the backend is still there and reacts fine if I reach it manually:

# curl httpd
<html><body><h1>It works!</h1></body></html>

When I shutdown the traefik container I see:

time="2016-09-19T20:51:06Z" level=info msg="I have to go... terminated"
time="2016-09-19T20:51:06Z" level=info msg="Stopping server"

When I restart traefik everything workd fine again (I don't need to restart the httpd backend server)

Do you need any additional information?

abhishekamralkar commented 8 years ago

traefik version Version: v1.0.2 Codename: reblochon Go version: go1.6.2 Built: 2016-08-02_05:29:50PM OS/Arch: linux/amd64

This is the version I am using. Not sure whats going on but it just hanged and nothing works

emilevauge commented 8 years ago

I think I found the issue. This seems due to a race condition in https://github.com/thoas/stats. It produces when accessing to the /health endpoint of traefik's web ui, and at the same time, making requests to traefik reverse proxy.

Could you confirm that you are accessing /health during your tests (with a healthcheck or if you have the web ui opened) ?

A workaround is to avoid accessing webui during tests and change your healthcheck to /api endpoint.

I'm investigating if the issue is still present in the master branch.

SvenAbels commented 8 years ago

@emilevauge Yes, indeed that has been the case for me. I had the health UI check opened during the tests all the time.

r0bj commented 8 years ago

@emilevauge Yes, It seems that that was the case. I was accessing /health during the wrk test - this was actually a way of determinig if traefik is still working. And also I pointed marathon healtcheck to traefik /health.

ryanleary commented 8 years ago

We are experiencing similar behavior. We see this despite our healthcheck using the /api endpoint.

emilevauge commented 8 years ago

@ryanleary then could you give as much details as possible?

SvenAbels commented 8 years ago

@emilevauge I did some load tests and I can confirm that the health UI was indeed the problem - at least in my case. I can reproduce the problem many times when the health UI is opened but when it's closed, I cannot reproduce the problem anymore - no matter what I try.

SvenAbels commented 8 years ago

@emilevauge Thanks a lot. Do you know when the next traefik release will be out, which has this fix included?

emilevauge commented 8 years ago

Already released: https://github.com/containous/traefik/releases/tag/v1.0.3 :) Just waiting for docker to merge: https://github.com/docker-library/official-images/pull/2169. You can use containous/traefik:v1.0.3 in the meantime.

SvenAbels commented 8 years ago

Perfect, thanks a lot. :-)

r0bj commented 8 years ago

Thanks. I can confirm that this issue is no longer valid with v1.0.3.

traefik / traefik

traefik hangs - stops handling requests #662