openzipkin / zipkin

Zipkin is a distributed tracing system
https://zipkin.io/
Apache License 2.0
17.01k stars 3.09k forks source link

Efficiency investigation #2757

Open codefromthecrypt opened 5 years ago

codefromthecrypt commented 5 years ago

I would maybe expect this when throttling is enabled, but I was surprised to see an OOM when it wasn't enabled. This happened when sending a bunch of requests to elasticsearch where both the zipkin server and elasticsearch were in the same container. I've never seen an OOM zipkin side, even with ES_HTTP_LOGGING=basic prior

zipkin                      | 2019-08-17 12:39:26.914  WARN 1 --- [orker-epoll-2-6] c.l.a.c.l.LoggingClient                  : Response: {startTime=2019-08-17T12:39:23.939Z(1566045563939802), length=0B, duration=0ns, cause=com.linecorp.armeria.client.ResponseTimeoutException, headers=[]}
zipkin                      | 
zipkin                      | com.linecorp.armeria.client.ResponseTimeoutException: null
zipkin                      | 
zipkin                      | Exception in thread "armeria-common-blocking-tasks-1-31" java.lang.OutOfMemoryError: Java heap space
zipkin                      |   at io.netty.util.internal.PlatformDependent.allocateUninitializedArray(PlatformDependent.java:264)
zipkin                      |   at io.netty.buffer.PoolArena$HeapArena.newByteArray(PoolArena.java:679)
zipkin                      |   at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:689)
zipkin                      |   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:245)
zipkin                      |   at io.netty.buffer.PoolArena.allocate(PoolArena.java:215)
zipkin                      |   at io.netty.buffer.PoolArena.allocate(PoolArena.java:147)
zipkin                      |   at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:325)
zipkin                      |   at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:168)
zipkin                      |   at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:159)
zipkin                      |   at zipkin2.elasticsearch.internal.BulkCallBuilder.write(BulkCallBuilder.java:154)
zipkin                      |   at zipkin2.elasticsearch.internal.BulkCallBuilder.build(BulkCallBuilder.java:134)
zipkin                      |   at zipkin2.elasticsearch.ElasticsearchSpanConsumer$BulkSpanIndexer.newCall(ElasticsearchSpanConsumer.java:122)
zipkin                      |   at zipkin2.elasticsearch.ElasticsearchSpanConsumer.accept(ElasticsearchSpanConsumer.java:62)
zipkin                      |   at zipkin2.collector.Collector.store(Collector.java:200)
zipkin                      |   at zipkin2.collector.Collector$StoreSpans.run(Collector.java:229)
zipkin                      |   at com.linecorp.armeria.common.AbstractRequestContext.lambda$makeContextAware$1(AbstractRequestContext.java:69)
zipkin                      |   at com.linecorp.armeria.common.AbstractRequestContext$$Lambda$889/0x00000001006a7440.run(Unknown Source)
zipkin                      |   at com.linecorp.armeria.common.AbstractRequestContext.lambda$makeContextAware$1(AbstractRequestContext.java:69)
zipkin                      |   at com.linecorp.armeria.common.AbstractRequestContext$$Lambda$889/0x00000001006a7440.run(Unknown Source)
zipkin                      |   at io.micrometer.core.instrument.AbstractTimer.record(AbstractTimer.java:160)
zipkin                      |   at io.micrometer.core.instrument.Timer.lambda$wrap$0(Timer.java:144)
zipkin                      |   at io.micrometer.core.instrument.Timer$$Lambda$897/0x00000001006a3840.run(Unknown Source)
zipkin                      |   at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
zipkin                      |   at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
zipkin                      |   at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
zipkin                      |   at java.base/java.lang.Thread.run(Thread.java:834)
codefromthecrypt commented 5 years ago

I'm going to run a comparison before (2.15.0) and after (2.16.1) to see if the memory usage is different. It is probably not a huge deal to OOM if things are timing out anyway, but better to hunt now before we announce, in case it is something small.

codefromthecrypt commented 5 years ago

here's some charts for a 100s run of brave-webmvc-example against latest docker with 2.16.1 and ES 7 (ES_HTTP_LOGGING=basic)

Screenshot 2019-08-17 at 8 59 26 PM Screenshot 2019-08-17 at 8 59 44 PM Screenshot 2019-08-17 at 8 59 58 PM Screenshot 2019-08-17 at 9 00 15 PM

codefromthecrypt commented 5 years ago

here's the same run, I only changed the image of zipkin to 2.15.0 leaving the ES image etc exactly the same.

Screenshot 2019-08-17 at 9 06 31 PM Screenshot 2019-08-17 at 9 06 56 PM Screenshot 2019-08-17 at 9 07 05 PM Screenshot 2019-08-17 at 9 07 20 PM

codefromthecrypt commented 5 years ago

I will now do the same without basic on, just in case something about basic logging is using a lot of memory

codefromthecrypt commented 5 years ago

this is 2.16.1 with complete defaults, no ES logging.

Screenshot 2019-08-17 at 9 15 49 PM Screenshot 2019-08-17 at 9 16 07 PM Screenshot 2019-08-17 at 9 16 20 PM Screenshot 2019-08-17 at 9 16 56 PM

codefromthecrypt commented 5 years ago

@anuraaga @trustin can I leave this one to you to look into? I don't want us to do a release announcement saying we did all this stuff to make things efficient with results that seem much worse. It wouldn't be a good reflection on armeria. Probably there's something we are doing that ends up less efficient and slower than before. Since we were using a very late version of okhttp, they had a long time to find all the bugs. Armeria and especially our use of it is quite new. Probably some opportunity to figure out what's up.

codefromthecrypt commented 5 years ago

and btw in both cases we were using armeria server. the main thing we did was change the elasticsearch client to use armeria vs okhttp. It ended up a lot of work because we also switched things like json codecs from using moshi/okio which shares buffers with okhttp underneath, and also changed a lot of the async behavior. I don't think health check has anything to do with it, but if necessary I could run something with it turned off.

anuraaga commented 5 years ago

Going to take a look at this. Assuming the graphs are from the normal wrk command against brave-springmvc?

Also just wondering, do you startup both a prometheus and a grafana to get those graphs? Wish it was easier to get graphs when just running locally šŸ˜…

anuraaga commented 5 years ago

Looked a bit, haven't seen any OOM on my computer yet (probably not as limited memory as the container).

One finding is when running 2.15.0 with throttling disabled for comparison, meaning setting ES_MAX_REQUESTS=1000000000, I could reproduce all requests becoming timeouts right away, with a similar spike in latency (viewing /internal/metrics instead of using Grafana to avoid the steps which would be more precise, I may be misreading the numbers but don't think so).

Will try in the coming days with less extreme loads to see what happens.

anuraaga commented 5 years ago

Actually, did the opposite - mocking out elasticsearch with an Armeria mock server that just returns a fixed 200 response, okhttp is definitely much faster

okhttp

# TYPE http_server_requests_seconds histogram
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.001",} 67.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.001048576",} 68.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.001398101",} 74.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.001747626",} 79.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.002097151",} 86.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.002446676",} 89.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.002796201",} 90.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.003145726",} 95.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.003495251",} 95.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.003844776",} 96.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.004194304",} 96.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.005592405",} 100.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.006990506",} 103.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.008388607",} 104.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.009786708",} 104.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.011184809",} 105.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.01258291",} 105.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.013981011",} 105.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.015379112",} 105.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.016777216",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.022369621",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.027962026",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.033554431",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.039146836",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.044739241",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.050331646",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.055924051",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.061516456",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.067108864",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.089478485",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.111848106",} 106.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.134217727",} 107.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.156587348",} 107.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.178956969",} 107.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.20132659",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.223696211",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.246065832",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.268435456",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.357913941",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.447392426",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.536870911",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.626349396",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.715827881",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.805306366",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.894784851",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.984263336",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="1.073741824",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="1.431655765",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="1.789569706",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="2.147483647",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="2.505397588",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="2.863311529",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="3.22122547",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="3.579139411",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="3.937053352",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="4.294967296",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="5.726623061",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="7.158278826",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="8.589934591",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="10.021590356",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="11.453246121",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="12.884901886",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="14.316557651",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="15.748213416",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="17.179869184",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="22.906492245",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="28.633115306",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="30.0",} 108.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="+Inf",} 108.0
http_server_requests_seconds_count{method="POST",status="202",uri="/api/v2/spans",} 108.0
http_server_requests_seconds_sum{method="POST",status="202",uri="/api/v2/spans",} 0.4687505

armeria

http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.001",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.001048576",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.001398101",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.001747626",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.002097151",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.002446676",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.002796201",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.003145726",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.003495251",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.003844776",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.004194304",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.005592405",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.006990506",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.008388607",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.009786708",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.011184809",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.01258291",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.013981011",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.015379112",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.016777216",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.022369621",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.027962026",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.033554431",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.039146836",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.044739241",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.050331646",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.055924051",} 0.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.061516456",} 1.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.067108864",} 1.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.089478485",} 2.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.111848106",} 4.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.134217727",} 5.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.156587348",} 5.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.178956969",} 5.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.20132659",} 9.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.223696211",} 13.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.246065832",} 21.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.268435456",} 36.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.357913941",} 90.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.447392426",} 99.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.536870911",} 100.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.626349396",} 100.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.715827881",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.805306366",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.894784851",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="0.984263336",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="1.073741824",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="1.431655765",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="1.789569706",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="2.147483647",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="2.505397588",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="2.863311529",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="3.22122547",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="3.579139411",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="3.937053352",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="4.294967296",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="5.726623061",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="7.158278826",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="8.589934591",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="10.021590356",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="11.453246121",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="12.884901886",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="14.316557651",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="15.748213416",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="17.179869184",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="22.906492245",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="28.633115306",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="30.0",} 101.0
http_server_requests_seconds_bucket{method="POST",status="202",uri="/api/v2/spans",le="+Inf",} 101.0
http_server_requests_seconds_count{method="POST",status="202",uri="/api/v2/spans",} 101.0
http_server_requests_seconds_sum{method="POST",status="202",uri="/api/v2/spans",} 29.0299599
trustin commented 5 years ago

Thanks @adriancole and @anuraaga. I will also take a look once I'm back from Taiwan mid next week.

anuraaga commented 5 years ago

Quick update from my investigation, it seems that handleAsync(..., ctx.contextAwareExecutor()) adds tens of milliseconds. changing back to handle made the numbers look much more sane.

https://github.com/openzipkin/zipkin/blob/master/zipkin-server/src/main/java/zipkin2/server/internal/ZipkinHttpCollector.java#L116

This does not make that much sense to me based on my expectations of how handleAsync would work, but perhaps we need to tweak this to run callbacks inline when on the event loop thread

https://github.com/line/armeria/blob/master/core/src/main/java/com/linecorp/armeria/common/RequestContextAwareEventLoop.java

Will verify self tracing before sending out a change to sync handle.

codefromthecrypt commented 5 years ago

ok thanks. this seems like great progress. also that's right about before we had a semaphore which would have dropped immediately.

wrt the callbacks, if we need to temporarily remove self-tracing that's not a big deal as few use it. since the cat is already out-of-the bag (we've already released eventhough not notified with notes etc), any low-hanging fruit to make sure people have a good experience is much more important than self-tracing.

On Sun, Aug 18, 2019 at 7:28 AM Anuraag Agrawal notifications@github.com wrote:

Quick update from my investigation, it seems that handleAsync(..., ctx.contextAwareExecutor()) adds tens of milliseconds. changing back to handle made the numbers look much more sane.

https://github.com/openzipkin/zipkin/blob/master/zipkin-server/src/main/java/zipkin2/server/internal/ZipkinHttpCollector.java#L116

This does not make that much sense to me based on my expectations of how handleAsync would work, but perhaps we need to tweak this to run callbacks inline when on the event loop thread

https://github.com/line/armeria/blob/master/core/src/main/java/com/linecorp/armeria/common/RequestContextAwareEventLoop.java

Will verify self tracing before sending out a change

ā€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openzipkin/zipkin/issues/2757?email_source=notifications&email_token=AAAPVVYLCQ7ZIIVONMPCHITQFCCQLA5CNFSM4IMPKAJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4QVIZY#issuecomment-522277991, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAPVV3AVWTLPNRK3PCE3BTQFCCQLANCNFSM4IMPKAJQ .

codefromthecrypt commented 5 years ago

Thanks for the progress, @anuraaga. cc also @trustin Here's the current master vs 2.15.0 on the same burst of 100s in the same docker etc everything. It is much much better than before, as it doesn't end up blocking clients for seconds which could really mess up things.

We should admit our current evolution is still significantly more heap usage, more and longer gcs, and an order magnitude higher p90 response time now vs 2.15 (okhttp based) for the same load. This doesn't mean we enter a pit of work, but I do think we need to at least acknowledge this.

Using zipkin 2.15.0

Screenshot 2019-08-19 at 10 54 27 AM Screenshot 2019-08-19 at 10 54 46 AM

Using the last commit (verified the actual commit ID from the /info endpoint) Screenshot 2019-08-19 at 10 02 00 AM Screenshot 2019-08-19 at 10 02 35 AM

TL;DR; we should probably continue this investigation, or not mention efficiency as a rationale for upgrading during release notes. There are other wins such as load balancing or platform coherency we can still mention, and I know many will be happy to have these features as they were directly requested.

codefromthecrypt commented 5 years ago

I created an issue to make troubleshooting easier though it isn't hard to start frontend/backend and smack it with wrk (agreed current approach doesn't help in cloud but we aren't using cloud here either) https://github.com/openzipkin/sleuth-webmvc-example/issues/27