High Response time with ELB + docker-flow-proxy deployed on a swarm cluster

ankitarya10 commented 7 years ago

I have swarm cluster of 6 nodes(AWS instances) deployed under VPC. While testing the webserver I noticed highly varying response times. My setup is as follows: Route53 -> ELB -> Docker-Flow-Proxy-> Cherrypy Webserver. To test the source of the problem, I created four tests:

Test 1: web service deployed via compose + ELB, Avg Response time: 138.22ms https://a.blazemeter.com/app/?public-token=64RNazaF4zbWDb8BfGTPRgxp013v3bTpZaenOsp6yvsNu0uZEv#/accounts/130040/workspaces/123459/projects/167833/masters/16018548/summary
Test 2: web service deployed in swarm + ELB Avg Response time: 132.23ms https://a.blazemeter.com/app/?public-token=SdB7jgxXEdq3kuaAzqJ8NSpqiSKdlZmjKUkhWV17o3FzRikRmX#/accounts/130040/workspaces/123459/projects/167833/masters/16018597/summary
Test 3: web service deployed in swarm + Reverse Proxy + ELB Avg Response time: 6.11s https://a.blazemeter.com/app/?public-token=eQbj7VZRx20SOFZCXIYCGlLBLPqm6MwyM2zPrmJ4ssuuC1krA5#/accounts/130040/workspaces/123459/projects/167833/masters/16018611/summary
Test 4: web service deployed in swarm + Reverse Proxy + ELB + Route53. Avg Response time: 8.01s ( 13.03%) error https://a.blazemeter.com/app/?public-token=XfwzlVrq8wUee503Aeh1pkWYl1k6ZDeCz6k0e65Q6NpTA0Qgb7#/accounts/130040/workspaces/123459/projects/167833/masters/16018604/errorsreport

My conclusion from the above tests is that problem can be at two places: a) Connection between ELB and docker-flow-proxy b) routing from docker-flow to webserver.

ankitarya10 commented 7 years ago

I also utilized the debug feature for proxy, from the recent release to look at the logs.

2017/03/09 21:53:47 HAPRoxy: services framework_naip-be5000/framework_naip 0/0/0/2/2 200 163 - - ---- 2/2/0/1/0 0/0 {-,"",""} "GET /naip/api/1/naip/healthcheck HTTP/1.1" 2017/03/09 21:53:49 HAPRoxy: services services/ -1/-1/-1/-1/5001 408 1263 - - cR-- 1/1/0/0/0 0/0 {-,"",""} "" 2017/03/09 21:53:54 HAPRoxy: services services/ -1/-1/-1/-1/5001 408 1263 - - cR-- 1/1/0/0/0 0/0 {-,"",""} "" 2017/03/09 21:53:59 HAPRoxy: services services/ -1/-1/-1/-1/5002 408 1263 - - cR-- 1/1/0/0/0 0/0 {-,"",""} "" 2017/03/09 21:54:04 HAPRoxy: services services/ -1/-1/-1/-1/5001 408 1263 - - cR-- 0/0/0/0/0 0/0 {-,"",""} "" 2017/03/09 21:54:09 HAPRoxy: services services/ -1/-1/-1/-1/5002 408 1263 - - cR-- 0/0/0/0/0 0/0 {-,"",""} "" 2017/03/09 21:54:14 HAPRoxy: services services/ -1/-1/-1/-1/5002 408 1263 - - cR-- 0/0/0/0/0 0/0 {-,"",""} "" 2017/03/09 21:54:19 HAPRoxy: services services/ -1/-1/-1/-1/5002 408 1263 - - cR-- 0/0/0/0/0 0/0 {-,"",""} "" 2017/03/09 21:54:24 HAPRoxy: services services/ -1/-1/-1/-1/5002 408 1263 - - cR-- 0/0/0/0/0 0/0 {-,"",""} "" 2017/03/09 21:54:29 HAPRoxy: services services/ -1/-1/-1/-1/5001 408 1263 - - cR-- 0/0/0/0/0 0/0 {-,"",""} ""

First one is healthcheck from ELB, not sure what others are.

ankitarya10 commented 7 years ago

To replicate the issue, I created a 3 node swarm cluster.

docker network create -d overlay proxy curl -o proxy.yml \ https://raw.githubusercontent.com/vfarcic/docker-flow-proxy/master/docker-compose-stack.yml

docker stack deploy -c proxy.yml proxy

docker pull ankitarya/webserver // Get YAML file from here: http://pastebin.com/2jjXxLAU docker stack deploy -c docker-compose-webserver.yml app

Ran two Tests:

Test1: Webserver -> Reverse Proxy https://a.blazemeter.com/app/?public-token=bUwxqKh45yfb5oTWe32322u5MZMVaMn0lw32DsrVPTXH3BDWYw#/accounts/130040/workspaces/123459/projects/167833/masters/16021634/summary
Test2: Webserver -> Reverse Proxy -> ELB https://a.blazemeter.com/app/?public-token=pKQMeZ3CACwl77zfk9Bw3rVg9DBaNOTb3TnhWPQnRdkkSmusXA#/accounts/130040/workspaces/123459/projects/168089/masters/16021693/summary

Both tests ran fine, with 0.28% error on proxy, which I think is acceptable. However, now I am really confused about what is actually wrong. Maybe deploy a test cluster under VPC and do further evaluation.

vfarcic commented 7 years ago

Can you do a test of "web service deployed in swarm + Reverse Proxy" (without ELB). The result should be the same as with it since ELB (according to the results) does not add almost any overhead.

As a side note, I don't think I'll be able to dive into this before the weekend. I hope that's OK.

RaviPi-Kore commented 7 years ago

@ankitarya10 find below the reports. these are the reports genareated by Jmeter

With Proxy avg is 286 MS with proxy no elb

With Proxy+ELB avg is 306ms

with elb proxy

With Proxy+ELB+DNS avg is 307ms

with_elb_proxy_dns

vfarcic commented 7 years ago

Closing due to inactivity. Feel free to reopen if the problem persists.

shabbirkagalwala commented 7 years ago

Hello,

I am also facing this same issue.I have 2 Manager nodes attached to the ELB both running swarm-listener and proxy services. But the response time is way too high, just removing one manager from the ELB brings up the page within seconds rather than the page timing out.

Any help would be appreciated.

Thanks !

vfarcic commented 7 years ago

If removing a manager speeds it up, it seems that the issue is not related to DFP. As an additional test, you can open a port directly on your service and then compare response times with and without DFP. Also, it would be useful to collect metrics both from ELB and from DFP and compare them.

Do you use scripts (e.g. Terraform, CloudFormation) to set up your cluster? If you do, I could reproduce it on my account and try to figure out what's wrong.

shabbirkagalwala commented 7 years ago

I dont have a cloud formation template but its a very simple configuration, 2 managers 2 workers. 2 managers are behind the ELB, running docker flow proxy.

Seems like an issue with the classic ELB to me as well, the load balancer woks fine with just manager node as soon as I add the 2nd manager and it comes inService the application just stops responding.

I am running a few tests to get the response times and as soon as I am done I will post it here.

Thank you for your prompt reply.

shabbirkagalwala commented 7 years ago

I think i finally figured it out, when both the managers are in the same AZ it works like a charm and latency is very less but as soon as I add a manager in another AZ behind the ELB (Manager1 - AZ1 and Manager2-AZ2) it stops working. Maybe you can help me out here, is it that DFP is not able to communicate to the containers in the 2nd AZ because of some ports not being open? Or could there be some other reason?

Thank you for all your help.! Appreciate it!!

vfarcic commented 7 years ago

In that case, the problem is almost certainly not related to DFP but in Docker networking between AZs. DFP or, to be more precise, HAProxy, only forwards requests to one of the services. That forwarding is done through Docker networking which handles everything else (LB, service discovery, and so on).

How did you create your cluster? Do you have Terraform or CloudFormation configs that I could use to reproduce it in my account?

shabbirkagalwala commented 7 years ago

No actually I dont have a CloudFormation Template yet, still running tests to make sure everything works. Should have the template created by the end of this week. I will definitely post it here for you to check it out.

vfarcic commented 7 years ago

Great. That way I can replicate the same and try to pinpoint the cause of the problem. In the meantime, can you repeat the tests with some public service and send me the commands you executed (both to create the services as to test). That should be quick and I can rerun the same on my cluster. If results are different, we'll know for certain that there's something wrong with the way you setup your cluster.

shabbirkagalwala commented 7 years ago

I followed these steps: docker network create --driver overlay proxy docker network create --driver overlay myapp

created proxy service replicas=5 on all node (2 managers and 3 workers): docker service create --name proxy \ -p 80:80 \ -p 443:443 \ --network proxy \ --replicas=5 \ -e MODE=swarm \ -e LISTENER_ADDRESS=swarm-listener \ vfarcic/docker-flow-proxy (Not sure if 5 replicas are required)

created one swarm-listener: docker service create --name swarm-listener \ --network proxy \ --mount "type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock" \ -e DF_NOTIFY_CREATE_SERVICE_URL=http://proxy:8080/v1/docker-flow-proxy/reconfigure \ -e DF_NOTIFY_REMOVE_SERVICE_URL=http://proxy:8080/v1/docker-flow-proxy/remove \ --constraint 'node.role==manager' \ vfarcic/docker-flow-swarm-listener

started my service up using: docker service create --name myapp \ -e DB=go-demo-db \ --network myapp \ --network proxy \ --label com.df.notify=true \ --label com.df.distribute=true \ --label com.df.servicePath=/demo \ --label com.df.port=8081 \ solarwinds/whd-embedded:latest

thats all I am doing.

While doing this try having one of the workers or managers in a different AZ, the app times out and webpage doesn't load.

vfarcic commented 7 years ago

Can you please let me know what is the path your service should be accessible from? I set it up on my cluster but I'm not sure how to open it. I guess it's not /demo. You can see it from http://dockeredg-external-1uzwfo1thaojq-1958892365.us-east-1.elb.amazonaws.com/demo .

shabbirkagalwala commented 7 years ago

Ohh sorry it should be --label com.df.servicePath=/ without the demo. Sorry!!

vfarcic commented 7 years ago

Can you confirm that the following log is "normal" or something failed?

-------------------------------------------
Running Entrypoint : true
-------------------------------------------
2017-06-14 19:46:18,334 CRIT Supervisor running as root (no user in config file)
2017-06-14 19:46:18,343 INFO RPC interface 'supervisor' initialized
2017-06-14 19:46:18,343 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2017-06-14 19:46:18,343 INFO supervisord started with pid 5
2017-06-14 19:46:19,345 INFO spawned: 'whd' with pid 8
2017-06-14 19:46:19,351 INFO success: whd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2017-06-14 19:46:40,039 INFO exited: whd (exit status 0; expected)

shabbirkagalwala commented 7 years ago

This is correct! Nothing failed.

I appreciate you doing this, thank you so much!

vfarcic commented 7 years ago

I created the same service in my cluster (created with "Docker for AWS"). It consists of three nodes, each in a different AZ. I could not find any problem. The page is loading decently fast.

I'll leave it up and running on http://dockeredg-external-1uzwfo1thaojq-1958892365.us-east-1.elb.amazonaws.com/ .

I'm not sure what the problem is but it doesn't seem to be related with DFP. My best guess is that there's something wrong with your cluster setup. Maybe you can try setting it up with "Docker for AWS" and check whether the problem persists. If it does, you probably have some restriction on your AWS account. If it doesn't, you'll know that it is related with your current setup.

shabbirkagalwala commented 7 years ago

Thank you so much for going through and testing this out. I am trying it with "Docker for AWS" now to see if it works, it could be my security groups or NACL but i tried tweeking those to basically allow all in and outbound traffic but this still wouldnt work. If "Docker for AWS" gives an issue i will contact AWS to check if there are any restrictions on my account.

last question do i need to have proxy running all nodes and how many replicas of swarm listeners should i have if i have 3 managers?

vfarcic commented 7 years ago

There's no need to run DFP on all nodes. Docker's Ingress network will forward requests from any node to DFP. Normally, I run two or three instances of DFP only for high-availability. One should be enough but, in case it fails, you want to have one more until Swarm brings it back up again.

As for DFSL, run only one replica. There's no reason to have more. Actually, having more than one DFSL would only do you harm since you'd have duplicated requests to the proxy.

I'll close this issue since it does not seem to be related to DFP. Feel free to reopen it if you disagree or if you come with new info.

vfarcic / docker-flow-proxy

High Response time with ELB + docker-flow-proxy deployed on a swarm cluster #172