gratipay / gratipay.com

Here lieth a pioneer in open source sustainability. RIP
https://gratipay.news/the-end-cbfba8f50981
MIT License
1.12k stars 308 forks source link

Something is really broken in our http serving #1617

Closed zbynekwinkler closed 11 years ago

zbynekwinkler commented 11 years ago

Something is IMO horribly broken in gittip http serving. The file 'reset.css' is under 1kB in size. When testing the site with 'ab' it is easy to see that no matter the number of threads, with just 20 parallel connections we cannot get more than 5 requests per second. That is like 4kB/s transfer speed for a plain small file fully cached in memory. BTW: that transfer speed, that is just twice the size of our logs.

Just for the perspective - I believe each browser uses up to 6 connection for a single domain. The homepage does 8 requests to the main site. When someone on twitter with a 100 followers sends out a tweet where 1 out of 10 clicks the link, we get 10 users, 60 parallel connections and 80 requests.... and slow down to a crawl :(

And I am not even mentioning that hitting the homepage, the most db optimized page on gittip, we get not 5 but 1 request per second. So, if the site does not die right away (that is the better case), we go to 10s to 15s per request.

bruceadams commented 11 years ago

I see essentially the same results. The maximum transfer speed I'm seeing is about 6 Kbytes/sec, I get about that for any thread count (ab -c#) from 2 up through 20. With just a single ab thread count, I get 5 Kbytes/sec.

zbynekwinkler commented 11 years ago

The best thing we can do right now is consolidate css and js files and put all static assets on CDN. That would bring the request count per page to 1. That alone would be eightfold increase for users with a cold cache (a storm of new visitors we are expecting on Friday after Chads interview is out).

zbynekwinkler commented 11 years ago

I guess what I would really like is to get performance like this (it is a fairly high profile site in CZ, like czech google):

C:\home\zbynek\www.gittip.com>ab.exe -c20 -n100 http://www.seznam.cz/st/3.33.21/css/medium.css?3.33.21
This is ApacheBench, Version 2.3 <$Revision: 1430300 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking www.seznam.cz (be patient).....done

Server Software:        nginx/1.4.2
Server Hostname:        www.seznam.cz
Server Port:            80

Document Path:          /st/3.33.21/css/medium.css?3.33.21
Document Length:        4167 bytes

Concurrency Level:      20
Time taken for tests:   0.773 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      450000 bytes
HTML transferred:       416700 bytes
Requests per second:    129.35 [#/sec] (mean)
Time per request:       154.616 [ms] (mean)
Time per request:       7.731 [ms] (mean, across all concurrent requests)
Transfer rate:          568.45 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        2    7   7.8      4      37
Processing:    13   98  53.2     85     347
Waiting:        5   62  60.0     50     347
Total:         22  105  59.7     89     379

Percentage of the requests served within a certain time (ms)
  50%     89
  66%     97
  75%    104
  80%    110
  90%    166
  95%    268
  98%    354
  99%    379
 100%    379 (longest request)

but even something like this would be ok :).

C:\home\zbynek\www.gittip.com>ab.exe -c20 -n100 http://assets-gittipllc.netdna-ssl.com/10.1.36/hero-bg.png
This is ApacheBench, Version 2.3 <$Revision: 1430300 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking assets-gittipllc.netdna-ssl.com (be patient).....done

Server Software:        NetDNA-cache/2.2
Server Hostname:        assets-gittipllc.netdna-ssl.com
Server Port:            80

Document Path:          /10.1.36/hero-bg.png
Document Length:        9746 bytes

Concurrency Level:      20
Time taken for tests:   4.849 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      1002800 bytes
HTML transferred:       974600 bytes
Requests per second:    20.62 [#/sec] (mean)
Time per request:       969.897 [ms] (mean)
Time per request:       48.495 [ms] (mean, across all concurrent requests)
Transfer rate:          201.94 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       39   47  11.7     41      87
Processing:    69  786 203.6    781    1187
Waiting:       65  458 258.1    448    1187
Total:        130  834 203.4    822    1252

Percentage of the requests served within a certain time (ms)
  50%    822
  66%    840
  75%    898
  80%    956
  90%   1102
  95%   1188
  98%   1235
  99%   1252
 100%   1252 (longest request)
bruceadams commented 11 years ago

This issue appears to be fundamental to Aspen. Running gittip locally (which is difficult far too often, see #1619) I get worse throughput with more requesting threads.

$ ab -n100 -c1 http://127.0.0.1:8537/assets/reset.css | grep '(mean)'
Requests per second:    14.73 [#/sec] (mean)
Time per request:       67.910 [ms] (mean)
$ ab -n100 -c2 http://127.0.0.1:8537/assets/reset.css | grep '(mean)'
Requests per second:    16.75 [#/sec] (mean)
Time per request:       119.411 [ms] (mean)
$ ab -n100 -c4 http://127.0.0.1:8537/assets/reset.css | grep '(mean)'
Requests per second:    5.83 [#/sec] (mean)
Time per request:       686.101 [ms] (mean)
$ ab -n100 -c8 http://127.0.0.1:8537/assets/reset.css | grep '(mean)'
Requests per second:    4.86 [#/sec] (mean)
Time per request:       1646.592 [ms] (mean)

The CPU on my 4-core laptop never went anywhere near to 100%. My theory is that Aspen is, in effect, single threaded.

This behavior is much like Rails not too long ago. Heroku has made a lot money because they allow one-request-at-a-time Rails apps to scale up easily with many dynos.

I assume the quickest (and not cheap) fix is to add more Gittip Aspen servers. This requires some code fixes, to move user session handling (for Twitter and Bitbucket) out of memory.

zbynekwinkler commented 11 years ago

What is your value of MIN_THREADS in local.env? In production we are now running with 40.

Aspen is using under the hood the http server from cherrypy (cheroot). I've never checked it any closer so I don't know anything about it. When I get home I was planning to try some simple wsgi hello world app under different wsgi containers/servers (like gunicorn or uwsgi) locally to see what difference does the server make.

bruceadams commented 11 years ago

I hadn't even thought about MIN_THREADS. Looking now, it's set to the default of MIN_THREADS=10.

clone1018 commented 11 years ago

I've always been saying aspen is a little slow, I wanted to do some tests like @zwn will be doing but I never found the time. Good luck, let us know what happens!

bruceadams commented 11 years ago

Testing a single static page on a vanilla Aspen server gives much better performance. It is fast enough that I'm not sure I can tell how well it is processes requests in parallel. (These tests were run on my 4-core i5 laptop.)

ba:~$ ab -n1000 -c1 http://127.0.0.1:8080/ 2>/dev/null | grep '(mean)'
Requests per second:    793.02 [#/sec] (mean)
Time per request:       1.261 [ms] (mean)
ba:~$ ab -n1000 -c2 http://127.0.0.1:8080/ 2>/dev/null | grep '(mean)'
Requests per second:    441.07 [#/sec] (mean)
Time per request:       4.534 [ms] (mean)
ba:~$ ab -n1000 -c4 http://127.0.0.1:8080/ 2>/dev/null | grep '(mean)'
Requests per second:    383.14 [#/sec] (mean)
Time per request:       10.440 [ms] (mean)
ba:~$ ab -n1000 -c8 http://127.0.0.1:8080/ 2>/dev/null | grep '(mean)'
Requests per second:    447.95 [#/sec] (mean)
Time per request:       17.859 [ms] (mean)
ba:~$ ab -n1000 -c16 http://127.0.0.1:8080/ 2>/dev/null | grep '(mean)'
Requests per second:    490.06 [#/sec] (mean)
Time per request:       32.649 [ms] (mean)
clone1018 commented 11 years ago

@bruceadams was your test using postgression for your local Gittip?

bruceadams commented 11 years ago

@clone1018 yes, I used Postgression. I don't expect fetching a static page to exercise the database. I could capture database network traffic to be sure.

clone1018 commented 11 years ago

Since we serve the static page via aspen as a webserver, is there a chance we're opening a database connection?

chadwhitacre commented 11 years ago

We should also look at the effect of logging on performance. I have a tingly sense that lots of stdout chatter could slow us down.

bruceadams commented 11 years ago

Logging has the potential to have a huge impact on performance. The best logging frameworks jump through all kinds of hoops to avoid synchronization, excessive object creation and other stuff.

chadwhitacre commented 11 years ago

Since we serve the static page via aspen as a webserver, is there a chance we're opening a database connection?

If we are it'd be in a hook, and the places to check would be in configure-aspen.py and gittip/wireup.py.

zbynekwinkler commented 11 years ago

@bruceadams I am not getting the same results as you.

$ ab -c1 -n100 http://localhost:8537/assets/reset.css | grep '(mean)'
Requests per second:    166.02 [#/sec] (mean)
Time per request:       6.024 [ms] (mean)
$ ab -c2 -n100 http://localhost:8537/assets/reset.css | grep '(mean)'
Requests per second:    117.18 [#/sec] (mean)
Time per request:       17.068 [ms] (mean)
$ ab -c4 -n100 http://localhost:8537/assets/reset.css | grep '(mean)'
Requests per second:    87.36 [#/sec] (mean)
Time per request:       45.790 [ms] (mean)
$ ab -c8 -n100 http://localhost:8537/assets/reset.css | grep '(mean)'
Requests per second:    78.20 [#/sec] (mean)
Time per request:       102.305 [ms] (mean)
$ ab -c16 -n100 http://localhost:8537/assets/reset.css | grep '(mean)'
Requests per second:    77.45 [#/sec] (mean)
Time per request:       206.598 [ms] (mean)

But the pattern is clear. The more threads the worse. I have also 4 core i5 laptop. I am able to get to 50% utilization in total.

bruceadams commented 11 years ago

@zwn interesting. Do you have a local database? If yes, your faster results suggest the database is being hit when serving this static asset. (I haven't had a chance to check.) (Also, I could not use "localhost" because my machine treats "localhost" as ip6 and something misbehaved.)

zbynekwinkler commented 11 years ago

@bruceadams I have a local db. I am not sure what work we do with db per request (I have looked but we need some instrumentation to be sure).

zbynekwinkler commented 11 years ago

Running gunicorn -w 10 with default hello world app from http://docs.gunicorn.org/en/latest/run.html :

$ ab -c1 -n100 http://localhost:8000/ | grep '(mean)'
Requests per second:    1291.87 [#/sec] (mean)
Time per request:       0.774 [ms] (mean)
$ ab -c10 -n100 http://localhost:8000/ | grep '(mean)'
Requests per second:    5396.07 [#/sec] (mean)
Time per request:       1.853 [ms] (mean)
$ ab -c100 -n100 http://localhost:8000/ | grep '(mean)'
Requests per second:    3472.34 [#/sec] (mean)
Time per request:       28.799 [ms] (mean)

That is a different ballpark.

chadwhitacre commented 11 years ago

It is possible to mount Aspen as a WSGI app on gunicorn:

http://aspen.io/deployment/

bruceadams commented 11 years ago

Watching network traffic to the database, yes there is a round trip to the database for every fetch of assets/reset.css I'll see if I can figure out where that is in the code (by stumbling around in the dark).

chadwhitacre commented 11 years ago

@bruceadams It's probably in a hook somewhere. Actually, it's probably related to authentication.

chadwhitacre commented 11 years ago

I'm in the process of setting up a load testing environment locally. I have the FreeBSD machine I set up while my laptop was out of commission. I'm installing ab there to hit Gittip on my laptop over the LAN.

chadwhitacre commented 11 years ago

We removed the extra db hit per request in #1626. However, I'm still seeing ~3 req/sec for reset.css in production, as we were the other day.

chadwhitacre commented 11 years ago

For kicks, I stood up an instance of Gittip at Digital Ocean (smallest droplet, NY1). It's reading from the production database and is configured with MIN_THREADS=40 and DATABASE_MAXCONN=100, as in production. I edited configure-aspen.py by hand to turn off the call to utils.update_homepage_tables_once.

In general, I'm seeing ~115 r/s for /assets/reset.css and ~9 r/s for /. From our Heroku dyno we're getting ~6 and ~3, respectively. In other words, we're faster from the DO droplet than we are from the Heroku dyno, even though both are reading from the same Heroku Postgres database.

Two possible ways to account for the discrepancy are SSL and a baseline load of ~2 req/sec, which are present on the Heroku dyno but not the DO droplet.

chadwhitacre commented 11 years ago

Testing locally I am seeing ~177 r/s for /assets/reset.css and 19 r/s for /. That's with a recent backup in a Postgres on a different machine on the same LAN.

chadwhitacre commented 11 years ago

I ran two ab commands at once, one from the laptop and another from the FreeBSD machine. The FreeBSD machine is serving Gittip connected to Postgres on the laptop. I set up a slow -n300 -c1 trickle against / on the laptop, then I fired off a -n100 -c20 against /assets/reset.css from the FreeBSD machine (I get lots of failed requests in such circumstances, but some runs complete). I'm seeing 130 r/s for /assets/reset.css even when the server is under load for /, and meanwhile the / run sees 11 r/s.

This result seems to indicate against the "baseline load" explanation for the slowness of reset.css at Heroku.

chadwhitacre commented 11 years ago

Using Gunicorn locally, I am seeing similar results as with Cheroot: ~170 r/s for /assets/reset.css and ~20 r/s for /.

Edit: I may very well not have been running Gunicorn with multiple processes, cf. https://github.com/gittip/www.gittip.com/issues/1617#issuecomment-27149658.

chadwhitacre commented 11 years ago

Which, I suppose, is all to say that, yes, something is really broken in our http serving.

:broken_heart:

chadwhitacre commented 11 years ago

BTW, here are the steps I followed to run Gittip with Gunicorn on port 8537:

chadwhitacre commented 11 years ago

In general I've been using -n300 -c20 as we did the other day.

chadwhitacre commented 11 years ago

I believe that this ticket is about the two order of magnitude difference in serving /assets/reset.css between Heroku and the other environments we've looked at. The difference between / at Heroku and elsewhere is not as pronounced. My hunch is the smaller difference has the same root cause as the larger difference, but in the dynamic case that cause is being swamped by time spent querying the database or other factors.

chadwhitacre commented 11 years ago

I've opened a support ticket with Heroku:

We're stuck trying to tune our application's performance. Could we get some eyeballs?

https://github.com/gittip/www.gittip.com/issues/1617

tl;dr In a couple other environments (local, vps), our app runs between 100 and 200 requests per second for a static page, and 10 and 20 requests per second for a dynamic page. On our Heroku dyno we're only seeing between 1 and 10 requests per second for both static and dynamic pages. Why are we off an order of magnitude for dynamic pages and two orders of magnitude for static pages?

P.S. Could we cc: bruce.adams@acm.org and zbynek.winkler@gmail.com on this ticket, please? Thanks! :-)

https://help.heroku.com/tickets/101694

zbynekwinkler commented 11 years ago

Which, I suppose, is all to say that, yes, something is really broken in our http serving.

:broken_heart:

I was kind of afraid of that. BTW: !m @whit537. I am excited to see you were continuing to debug this even after I went to bed yesterday :smile:

I believe that this ticket is about the order of magnitude difference in serving /assets/reset.css between Heroku the other environments we've looked at.

I agree. That is what has prompted me to open this ticket in the first place. Serving a static file at 4kB/s something must be horribly broken. :cry:

chadwhitacre commented 11 years ago

I just benchmarked http://motivate.im/, which is running at Heroku using this static hosting setup. I'm seeing 40 r/s, which is an order of magnitude more than we're seeing, and one or two orders of magnitude less than I would expect.

clone1018 commented 11 years ago

Wait, so motivate.im is running as just index.html or something? And you're getting 40 r/s? Seriously?

chadwhitacre commented 11 years ago

@clone1018 Yes. :-(

clone1018 commented 11 years ago

So after running these tests on a base nginx install here's the results (for the static page)

whit's personal home internet (13 hops): 95.68rps do vps next door: 4137.70rps dedi 12 hops away: 324.90rps

zbynekwinkler commented 11 years ago

I have tested locally our default development setup, cheroot with 10 threads, against gunicorn running with 10 workers, both on 4 core laptop. The main difference being threads against processes.

Threads (cheroot):

$ ab -c20 -n1000 http://localhost:8537/assets/reset.css | grep '(mean)'
Requests per second:    63.99 [#/sec] (mean)
Time per request:       312.534 [ms] (mean)

Processes (gunicorn):

$ ab -c20 -n1000 http://localhost:8537/assets/reset.css | grep '(mean)'
Requests per second:    742.54 [#/sec] (mean)
Time per request:       26.935 [ms] (mean)

Doing the same for the current homepage.

Threads:

$ ab -c20 -n1000 http://localhost:8537/ | grep '(mean)'
Requests per second:    33.79 [#/sec] (mean)
Time per request:       591.962 [ms] (mean)

Processes:

$ ab -c20 -n1000 http://localhost:8537/ | grep '(mean)'
Requests per second:    165.96 [#/sec] (mean)
Time per request:       120.510 [ms] (mean)
zbynekwinkler commented 11 years ago

I have created a new heroku app. Given the choice I've picked up region Europe since that is closer to me :). I have not added anything there (returns default 'get started' page) and repeated the load test.

$ ab -c20 -n1000 http://sheltered-lowlands-9512.herokuapp.com/ | grep '(mean)'
Requests per second:    104.96 [#/sec] (mean)
Time per request:       190.554 [ms] (mean)

I have not been able to find out how many hops away the site is. All I know is that it is more than 13. I have created another one in the default U.S. region.

$ ab -c20 -n100  http://calm-temple-2891.herokuapp.com/ | grep '(mean)'
Requests per second:    62.91 [#/sec] (mean)
Time per request:       317.930 [ms] (mean)

I've also run a redirect load test on the main gittip site:

$ ab -c20 -n100  http://gittip.herokuapp.com/ | grep '(mean)'
Requests per second:    45.44 [#/sec] (mean)
Time per request:       440.115 [ms] (mean)

That should rule out any SSL related slowdowns. Another reference is redoing the static file benchmark:

$ ab -c20 -n100  http://motivate.im/ | grep '(mean)'
Requests per second:    45.22 [#/sec] (mean)
Time per request:       442.240 [ms] (mean)

I did this to get some baseline to compare to. If we get to these numbers on the main site, it means the rest is on heroku (?)

zbynekwinkler commented 11 years ago

I have put hello world up on both test apps - one under gunicorn/lowlands, other under uwsgi/calm-temple. The rps and time remained similar to the default setup (uwsgi seems be responding better - the longest request time on gunicorn is larger than on uwsgi despite the fact that uwsgi/calm-temple-2891 is further from me).

zbynekwinkler commented 11 years ago

Could it be some form of anti-DOS measure? We could do some coordinated ab testing of one of the test apps I've set up. If our results are the same no matter if we run it all at the same time and separately, that could mean heroku can handle the load but is somehow limiting requests by IP.

I've run simple timeit test and it seems that the single threaded performance is only a bit worse than 2.67GHz i5 which should be plenty.

chadwhitacre commented 11 years ago

Hi,

Are you able to add the New Relic add on here so we can dig into this a bit?

Neil

chadwhitacre commented 11 years ago

Thanks Neil. We looked at adding New Relic a bit ago but it was pretty complicated, so we ended up logging metrics with Librato instead (1508). Do you have an idea of what you're looking for? Here's our dashboard.

Also, did we get [Bruce's email] and [Zbynek's email] cc'd on this ticket?

Thanks! :-)

chad

zbynekwinkler commented 11 years ago

I don't have access to the ticket.

zbynekwinkler commented 11 years ago

I still do not have access to that ticket. And I do not think we need New Relic. Those 2 simple apps I've set up display similar behavior and they basically do nothing (hello world). Had I have access to that ticket I'd have told Neil myself. Does Neil have github nick so we can ping him from here? There is lots of info here that is worth reading.

chadwhitacre commented 11 years ago

cc: @neilmiddleton

neilmiddleton commented 11 years ago

Hey @zwn, you should get updates to that ticket, but you won't have access to it. If we keep updates in the ticket then whoever is taking a look at it has everything they need.

chadwhitacre commented 11 years ago

Thanks @neilmiddleton. We were told over on https://help.heroku.com/tickets/100019 that we could request a manual CC on a ticket, and that we'd be able to have multiple people access one ticket. This works for https://help.heroku.com/tickets/100017, which @zwn opened and which I have access to.

neilmiddleton commented 11 years ago

Interesting, you're all added as cc's so it should be the same. Let me know if it's continuing.

kennethreitz commented 11 years ago

I haven't taken the time to read through all the comments yet, but if there's anything that I can do to help, I'm at your service.

Make sure that you aren't running ab on OSX. The one that Apple ships has some odd limitations that yield some odd under-preforming results.