hmstepanek commented 4 years ago

Problem

We need to verify that the new py-proxy service is actually better than legacy via, otherwise this is all for not.

Bench marks to run

[ ] Verify that the PDF loads takes <= the amount of time legacy via takes to load a PDF.
[x] Legacy via handled a max of around 140 requests per min. Verify that the new service is able to handle >= 80 requests per second. This must be done by hitting both the Python endpoint(s) and the Nginx endpoint(s) with caching disabled.

hmstepanek commented 4 years ago

Python app / endpoint (determines where to route the request to (legacy via or /pdf) based on Content-Type header):

Requests      [total, rate, throughput]  400, 80.24, 13.63
Duration      [total, attack, wait]      27.742332438s, 4.985101829s, 22.757230609s
Latencies     [mean, 50, 95, 99, max]    12.47475824s, 11.77142206s, 23.24519033s, 23.752083977s, 25.577029181s
Bytes In      [total, mean]              8453592, 21133.98
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    94.50%
Status Codes  [code:count]               0:22  200:378

Python app /pdf endpoint:

Requests      [total, rate, throughput]  400, 80.20, 9.56
Duration      [total, attack, wait]      25.639132592s, 4.987523773s, 20.651608819s
Latencies     [mean, 50, 95, 99, max]    8.150416293s, 6.242453982s, 22.185068105s, 22.570647339s, 22.625626935s
Bytes In      [total, mean]              5479180, 13697.95
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    61.25%
Status Codes  [code:count]               0:155  200:245

Legacy via:

Requests      [total, rate, throughput]  400, 80.22, 14.92
Duration      [total, attack, wait]      19.778355379s, 4.986168984s, 14.792186395s
Latencies     [mean, 50, 95, 99, max]    7.016142172s, 7.346745307s, 16.245971912s, 16.518444676s, 16.903189682s
Bytes In      [total, mean]              6588235, 16470.59
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    73.75%
Status Codes  [code:count]               0:105  200:295

What does this mean?

Note legacy via has only one endpoint that determines the content type and serves the PDF with the client embedded where as the py-proxy service has an endpoint that determines the content type and a seperate endpoint that serves the PDF so when benchmarking we are hammering legacy via's single endpoint against hammering py-proxy's / and /pdf endpoints.

Legacy via has a success rate of ~74% of requests whereas py-proxy has a combined success rate of 57% (61% /pdf requests of the / 94% content type requests succeed). The wait time on all the requests was about 20s in the py-proxy compared to legacy via which is 14s. The average latency within the py-proxy for receiving a pdf if we take the time it takes to determine content type + the time is takes to return the html with the PDF embedded (note this isn't including network time which there would be in the real world as the / endpoint returns a redirect and then the browser issues a second request to the /pdf endpoint) is about 17s compared to legacy via which is only 7s.

In conclusion I'd say we have not achieved our goal of improving legacy via and we would be better off using legacy via than this new solution.

hmstepanek commented 4 years ago

When benchmarking against 20 requests/s things look better for py-proxy.

Python app / endpoint (determines where to route the request to (legacy via or /pdf) based on Content-Type header):

Requests      [total, rate, throughput]  100, 20.20, 14.58
Duration      [total, attack, wait]      6.860028998s, 4.950608508s, 1.90942049s
Latencies     [mean, 50, 95, 99, max]    2.237923357s, 2.345697315s, 3.201397584s, 3.319143256s, 3.320226942s
Bytes In      [total, mean]              2236400, 22364.00
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    100.00%
Status Codes  [code:count]               200:100

Python app /pdf endpoint:

Requests      [total, rate, throughput]  100, 20.20, 18.10
Duration      [total, attack, wait]      5.52462065s, 4.950061468s, 574.559182ms
Latencies     [mean, 50, 95, 99, max]    1.187881775s, 1.221590978s, 1.68770485s, 1.750387192s, 1.773651346s
Bytes In      [total, mean]              2236400, 22364.00
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    100.00%
Status Codes  [code:count]               200:100

Legacy via:

Requests      [total, rate, throughput]  100, 20.19, 10.93
Duration      [total, attack, wait]      9.145717236s, 4.952518924s, 4.193198312s
Latencies     [mean, 50, 95, 99, max]    2.671858783s, 2.82267383s, 4.023254697s, 4.192488682s, 4.193198312s
Bytes In      [total, mean]              2233300, 22333.00
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    100.00%
Status Codes  [code:count]               200:100

What does this mean?

The average PDF request in py-proxy takes 3.5 s verses legacy via which takes 2.8s-this indicates that the user experience of loading a PDF will be significantly slower than legacy via. The combined wait time of requests in py-proxy was 2.5s compared to 4s in legacy via. So at 20 requests/s which is around the production PDF request rate on lms right now, the py-proxy service looks like it will hold up. What's interesting to note about these two comparisons (I ran another bench mark at 40 requests/s) is that while py-proxy has a smaller total wait time at smaller request rates it degrades much faster than legacy via as the request rate increases. For example, at 40 requests/s legacy via is still able to handle all requests successfully while py-proxy has a success rate of 94%.

Additional thoughts:

It's also worth noting that while py-proxy is able to handle the current traffic load of lms, we are continuing to increase our lms user base which means the request rate will increase so while this may work now in production, it may not work in the future and then we will be in much the same place we are currently with legacy via. It also may not work at all if we want to use this to replace existing via for non lms users. Of course you can always through more resources at the problem, spin up more app instances etc but we set out to make something that is more performant, more readable, and scales better than legacy via and what these results are showing is this new service actually degrades more quickly than legacy via meaning legacy via can handle a higher request rate than this new service.

hmstepanek commented 4 years ago

This was tested using vegeta which sends requests to a particular url at a particular request rate against make dev with 4 workers (so talking directly to gunicorn without passing through nginx) for 5 seconds. As this service is currently run in production, there would also be some additional latency of the request due to having to go through nginx to reach gunicorn but note this additional latency is not accounted for in this benchmarking. The same is also true for legacy via-nginx is run in production but not locally in this benchmarking. The same example pdf url shown below was used in all tests. This was compared against running the legacy via in docker locally.

py-proxy benchmark commands:

The following commands were run concurrently: echo "GET http://localhost:9082/http://pdf995.com/samples/pdf.pdf?via.open_sidebar=1" | vegeta attack -rate=20/1s -duration=5s | tee results.bin | vegeta report

echo "GET http://localhost:9082/pdf/http://pdf995.com/samples/pdf.pdf?via.open_sidebar=1" | vegeta attack -rate=20/1s -duration=5s | tee results.bin | vegeta report

legacy via benchmark commands:

echo "GET http://localhost:9080/http://pdf995.com/samples/pdf.pdf?via.open_sidebar=1" | vegeta attack -rate=20/1s -duration=5s | tee results.bin | vegeta report

See vegeta report docs for how to read the report results.

Note I didn't include the error messages of failing requests in the reports above but here they are in case anyone is interested:

Get http://localhost:9082/http://pdf995.com/samples/pdf.pdf?via.open_sidebar=1: dial tcp 0.0.0.0:0->[::1]:9082: connect: connection refused
Get http://localhost:9082/http://pdf995.com/samples/pdf.pdf?via.open_sidebar=1: read tcp 127.0.0.1:58406->127.0.0.1:9082: read: connection reset by peer

ajpeddakotla commented 4 years ago

After talking with @hmstepanek, we're going to hold off on doing further analysis on this issue until after the rewrites for both PDF and HTML are completed. Once the rewrites are completed, we can figure out what work is needed to optimize py-proxy.

seanh commented 4 years ago

Closing this in favour of two separate cards for future load testing and performance testing: https://github.com/hypothesis/py-proxy/issues/12, https://github.com/hypothesis/py-proxy/issues/13

hypothesis / via

Bench mark py-proxy compared to legacy via #9

Problem

Bench marks to run

What does this mean?

What does this mean?

Additional thoughts:

py-proxy benchmark commands:

legacy via benchmark commands: