Open codefromthecrypt opened 5 years ago
I'm going to run a comparison before (2.15.0) and after (2.16.1) to see if the memory usage is different. It is probably not a huge deal to OOM if things are timing out anyway, but better to hunt now before we announce, in case it is something small.
here's some charts for a 100s run of brave-webmvc-example against latest docker with 2.16.1 and ES 7 (ES_HTTP_LOGGING=basic)
here's the same run, I only changed the image of zipkin to 2.15.0 leaving the ES image etc exactly the same.
I will now do the same without basic on, just in case something about basic logging is using a lot of memory
this is 2.16.1 with complete defaults, no ES logging.
@anuraaga @trustin can I leave this one to you to look into? I don't want us to do a release announcement saying we did all this stuff to make things efficient with results that seem much worse. It wouldn't be a good reflection on armeria. Probably there's something we are doing that ends up less efficient and slower than before. Since we were using a very late version of okhttp, they had a long time to find all the bugs. Armeria and especially our use of it is quite new. Probably some opportunity to figure out what's up.
and btw in both cases we were using armeria server. the main thing we did was change the elasticsearch client to use armeria vs okhttp. It ended up a lot of work because we also switched things like json codecs from using moshi/okio which shares buffers with okhttp underneath, and also changed a lot of the async behavior. I don't think health check has anything to do with it, but if necessary I could run something with it turned off.
Going to take a look at this. Assuming the graphs are from the normal wrk command against brave-springmvc?
Also just wondering, do you startup both a prometheus and a grafana to get those graphs? Wish it was easier to get graphs when just running locally š
Looked a bit, haven't seen any OOM on my computer yet (probably not as limited memory as the container).
One finding is when running 2.15.0 with throttling disabled for comparison, meaning setting ES_MAX_REQUESTS=1000000000
, I could reproduce all requests becoming timeouts right away, with a similar spike in latency (viewing /internal/metrics instead of using Grafana to avoid the steps which would be more precise, I may be misreading the numbers but don't think so).
Will try in the coming days with less extreme loads to see what happens.
Actually, did the opposite - mocking out elasticsearch with an Armeria mock server that just returns a fixed 200 response, okhttp is definitely much faster
okhttp
# TYPE http_server_requests_seconds histogram
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.001",} 67.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.001048576",} 68.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.001398101",} 74.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.001747626",} 79.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.002097151",} 86.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.002446676",} 89.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.002796201",} 90.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.003145726",} 95.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.003495251",} 95.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.003844776",} 96.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.004194304",} 96.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.005592405",} 100.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.006990506",} 103.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.008388607",} 104.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.009786708",} 104.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.011184809",} 105.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.01258291",} 105.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.013981011",} 105.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.015379112",} 105.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.016777216",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.022369621",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.027962026",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.033554431",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.039146836",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.044739241",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.050331646",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.055924051",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.061516456",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.067108864",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.089478485",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.111848106",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.134217727",} 107.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.156587348",} 107.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.178956969",} 107.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.20132659",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.223696211",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.246065832",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.268435456",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.357913941",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.447392426",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.536870911",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.626349396",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.715827881",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.805306366",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.894784851",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.984263336",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="1.073741824",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="1.431655765",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="1.789569706",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="2.147483647",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="2.505397588",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="2.863311529",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="3.22122547",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="3.579139411",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="3.937053352",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="4.294967296",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="5.726623061",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="7.158278826",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="8.589934591",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="10.021590356",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="11.453246121",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="12.884901886",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="14.316557651",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="15.748213416",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="17.179869184",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="22.906492245",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="28.633115306",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="30.0",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="+Inf",} 108.0
http_server_requests_seconds_count{method="POST",status="202",uri="/api/v2/spans",} 108.0
http_server_requests_seconds_sum{method="POST",status="202",uri="/api/v2/spans",} 0.4687505
armeria
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.001",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.001048576",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.001398101",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.001747626",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.002097151",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.002446676",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.002796201",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.003145726",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.003495251",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.003844776",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.004194304",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.005592405",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.006990506",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.008388607",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.009786708",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.011184809",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.01258291",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.013981011",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.015379112",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.016777216",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.022369621",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.027962026",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.033554431",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.039146836",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.044739241",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.050331646",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.055924051",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.061516456",} 1.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.067108864",} 1.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.089478485",} 2.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.111848106",} 4.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.134217727",} 5.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.156587348",} 5.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.178956969",} 5.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.20132659",} 9.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.223696211",} 13.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.246065832",} 21.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.268435456",} 36.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.357913941",} 90.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.447392426",} 99.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.536870911",} 100.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.626349396",} 100.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.715827881",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.805306366",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.894784851",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.984263336",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="1.073741824",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="1.431655765",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="1.789569706",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="2.147483647",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="2.505397588",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="2.863311529",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="3.22122547",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="3.579139411",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="3.937053352",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="4.294967296",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="5.726623061",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="7.158278826",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="8.589934591",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="10.021590356",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="11.453246121",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="12.884901886",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="14.316557651",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="15.748213416",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="17.179869184",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="22.906492245",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="28.633115306",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="30.0",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="+Inf",} 101.0
http_server_requests_seconds_count{method="POST",status="202",uri="/api/v2/spans",} 101.0
http_server_requests_seconds_sum{method="POST",status="202",uri="/api/v2/spans",} 29.0299599
Thanks @adriancole and @anuraaga. I will also take a look once I'm back from Taiwan mid next week.
Quick update from my investigation, it seems that handleAsync(..., ctx.contextAwareExecutor())
adds tens of milliseconds. changing back to handle
made the numbers look much more sane.
This does not make that much sense to me based on my expectations of how handleAsync
would work, but perhaps we need to tweak this to run callbacks inline when on the event loop thread
Will verify self tracing before sending out a change to sync handle.
ok thanks. this seems like great progress. also that's right about before we had a semaphore which would have dropped immediately.
wrt the callbacks, if we need to temporarily remove self-tracing that's not a big deal as few use it. since the cat is already out-of-the bag (we've already released eventhough not notified with notes etc), any low-hanging fruit to make sure people have a good experience is much more important than self-tracing.
On Sun, Aug 18, 2019 at 7:28 AM Anuraag Agrawal notifications@github.com wrote:
Quick update from my investigation, it seems that handleAsync(..., ctx.contextAwareExecutor()) adds tens of milliseconds. changing back to handle made the numbers look much more sane.
This does not make that much sense to me based on my expectations of how handleAsync would work, but perhaps we need to tweak this to run callbacks inline when on the event loop thread
Will verify self tracing before sending out a change
ā You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openzipkin/zipkin/issues/2757?email_source=notifications&email_token=AAAPVVYLCQ7ZIIVONMPCHITQFCCQLA5CNFSM4IMPKAJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4QVIZY#issuecomment-522277991, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAPVV3AVWTLPNRK3PCE3BTQFCCQLANCNFSM4IMPKAJQ .
Thanks for the progress, @anuraaga. cc also @trustin Here's the current master vs 2.15.0 on the same burst of 100s in the same docker etc everything. It is much much better than before, as it doesn't end up blocking clients for seconds which could really mess up things.
We should admit our current evolution is still significantly more heap usage, more and longer gcs, and an order magnitude higher p90 response time now vs 2.15 (okhttp based) for the same load. This doesn't mean we enter a pit of work, but I do think we need to at least acknowledge this.
Using zipkin 2.15.0
Using the last commit (verified the actual commit ID from the /info endpoint)
TL;DR; we should probably continue this investigation, or not mention efficiency as a rationale for upgrading during release notes. There are other wins such as load balancing or platform coherency we can still mention, and I know many will be happy to have these features as they were directly requested.
I created an issue to make troubleshooting easier though it isn't hard to start frontend/backend and smack it with wrk (agreed current approach doesn't help in cloud but we aren't using cloud here either) https://github.com/openzipkin/sleuth-webmvc-example/issues/27
I would maybe expect this when throttling is enabled, but I was surprised to see an OOM when it wasn't enabled. This happened when sending a bunch of requests to elasticsearch where both the zipkin server and elasticsearch were in the same container. I've never seen an OOM zipkin side, even with ES_HTTP_LOGGING=basic prior