openzipkin-attic / docker-zipkin

Docker images for OpenZipkin
Apache License 2.0
687 stars 329 forks source link

Adds AWS Elasticsearch Service image so that it is easier to troubleshoot #222

Closed codefromthecrypt closed 5 years ago

codefromthecrypt commented 5 years ago

Note: a small instance will fall over after about 30 seconds of load :)

Though a small instance seems to stay alive against brave-webmvc-example

wrk -t4 -c64 -d20s http://localhost:8081 --latency

When ES fails it doesn't fail nicely. Often results in the following errors instead of an http response:

These warnings will fill logs at the rate of traffic, so I've asked Armeria what we should do about it.

zipkin                      | 2019-08-17 10:35:32.840  WARN 1 --- [orker-epoll-2-4] c.l.a.c.HttpResponseDecoder              : Unexpected exception:
zipkin                      |
zipkin                      | com.linecorp.armeria.client.ResponseTimeoutException: null

Also, there seems to be a glitch somewhere, as when I turn on throttling in attempts to mitigate the above, it seems to amplify failures somehow, like more drop counts than actual spans. https://github.com/openzipkin/zipkin/issues/2755

Anyway, I think this isn't worse than before, but there's still a good bit of work to make it easy to use Elasticsearch even when it is anemic.

cc @anuraaga @devinsba @llinder @jcarres-mdsol @Logic-32

codefromthecrypt commented 5 years ago

ps all these notes are about how we handle a very overloaded AWS and that which is getting a spike of http requests. I don't want to be alarmists as I'm aware a lot of our AWS users either don't have a lot of traffic or use SQS instead of http.

anuraaga commented 5 years ago

Just to confirm the setup, were you running zipkin locally and hitting AWS? Wonder how many MB/s you were uploading. I remember sending at least 5 MB/s on a cafe connection which quickly led to timeouts. It might have been ES clunking out but I also figured it just was saturation of the network.

If the case, at some point it may be interesting to compare against running the load test in AWS itself. Can probably wrap it all up in a terraform config that starts everything then destroys it. I can look into this.

codefromthecrypt commented 5 years ago

I guess you mean to ask if my local network is saturated? I guess maybe? but it does work with 20s of the same amount of load. anyways I agree that bundling the webmvc example as a docker container so that you can use ecs or something to have everything network local makes sense.

otoh there's still a good question to ask ourselves which is how do we want to handle any occurence like this.. should we rate limit that log.. and the linked issues.

what I mean is that it would be probably easy to overload even a local elasticsearch eliminating the AWS stuff. in such case we would still have things to consider about how or if to handle something. it is just that this setup is one that overloads with very little work.

On Sat, Aug 17, 2019, 7:19 PM Anuraag Agrawal notifications@github.com wrote:

Just to confirm the setup, were you running zipkin locally and hitting AWS? Wonder how many MB/s you were uploading. I remember sending at least 5 MB/s on a cafe connection which quickly led to timeouts. It might have been ES clunking out but I also figured it just was saturation of the network.

If the case, at some point it may be interesting to compare against running the load test in AWS itself. Can probably wrap it all up in a terraform config that starts everything then destroys it. I can look into this.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openzipkin/docker-zipkin/pull/222?email_source=notifications&email_token=AAAPVV7F4KYZNHGBFYTY3QTQE7NFZA5CNFSM4IMPCMCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4QJGKQ#issuecomment-522228522, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAPVV457HFRYW6M4P2KL3TQE7NFZANCNFSM4IMPCMCA .

anuraaga commented 5 years ago

Yeah makes sense to rate limit the logs. I haven't used it before but think we could apply BurstFilter either globally (if we wanted to establish a global default of not logging too much) or only to specific spammy logs

https://logging.apache.org/log4j/2.0/manual/filters.html

anuraaga commented 5 years ago

Actually scratch globally since I guess it treats all logs equally, so it probably makes more sense to apply BurstFilter individually if we need.

codefromthecrypt commented 5 years ago

actually it is easier to get the timeouts. just use elasticsearch normally.

docker-compose -f docker-compose.yml -f docker-compose-elasticsearch.yml up

then after about 20 seconds of load it will start falling over

codefromthecrypt commented 5 years ago

also the error is pretty useless it has no context except if you are logging the thread name. I really don't think this is a good use of the WARN level, and probably for us it is better to just disable it if this won't be addressed otherwise

zipkin                      | 2019-08-17 12:16:38.308  WARN 1 --- [orker-epoll-2-4] c.l.a.c.HttpResponseDecoder              : Unexpected exception:
zipkin                      | 
zipkin                      | com.linecorp.armeria.client.ResponseTimeoutException: null
codefromthecrypt commented 5 years ago

I think most of my comments have nothing to do with AWS. maybe better to move this to another ES related thing than spam like I often ask others not to do :P

Logic-32 commented 5 years ago

FWIW, has anyone looked at the Prometheus/graphite metrics coming out of the throttling store to make sure it was actually throttling? At least, at implementation time, timeouts weren't RejectedExecutionExceptions and likely ignored so you were probably trying to report at full steam still.

Do we know who configures the timeout for reporting spans to ES? Hard to tell from the log message you got but I'm assuming it is client-side? For reference, do we know how much time needs to elapse for that to trigger?

codefromthecrypt commented 5 years ago

@Logic-32 when I turn on throttling it does reduce the log lines, which is the most obvious sign of it working. I did notice error counters incremented more than accepts which didn't imply I looked at the other throttling metrics manually. I think we need to have another dashboard dedicated to ES as we have so many problems with it.

I don't know how people make modular grafana charts, but I think we should. cc @openzipkin/devops-tooling because there are also some other stats about ES client which we could show, but since not everyone uses ES or throttling, and our dashboard already has a ton of things on it.. seems something we would want to be able to choose more modularly. good point on looking at things.

codefromthecrypt commented 5 years ago

anyway I'm going to merge the simple part of this, which is a working docker-compose setup!

investigations can ensue easier now.