Closed gadams00 closed 8 years ago
If the error is fatal, then this is a problem!
If the error is non-fatal (i.e. cluttering the logs), it is less of a problem. I think race conditions will always exist and we should make handling them gracefully vs attempting to solve with orchestration. Probably a loud stack trace in the logs isn't graceful :)
The error is fatal. Once in this condition, if I go to (docker ip):8080 and hit find traces, nothing is returned, or I'll occasionally get an error "Endpoint zipkin-query is marked down". To get zipkin working again, I have to run "docker-compose run query" in a separate process, then zipkin-web will return traces. This situation continues through multiple restarts of docker-zipkin, unless I remove all related docker images from my system and run docker-compose up again, starting with freshly downloaded images. At that point, the query component will start and I can find new traces that get created after that, but for all subsequent runs of docker-compose after the initial, zipkin-query will fail to start, preventing zipkin-web from finding traces.
I encounter the same error today, is anyone has a solution?
This is related to how Finagle deals with repeated errors. Might be worth opening on OpenZipkin/zipkin wrt the "zipkin-query"
Oh that's great.
ex. there's some advice here which may need to apply to that process
Finagle isn't used in the zipkin-java project, so you wouldn't see that error. although zipkin-java only supports mysql at the moment.
Thanks adriancole.
I'm also getting this error when restarting. I've read through the comments here but I'm still not sure I understand the root cause, or solution? Is there some way this can be handled in docker compose?
I'm starring this to follow through. Please nag me on gitter OpenZipkin/zipkin if there's no remedy by Monday
I am also experiencing this exact issue. running "docker-compose run query" as @gadams00 did not fix the issue for me. doing a "docker restart {containerid} " starts the query server but i see the following logs from the query server.
19:29:49.269 [cluster1-reconnection-0] ERROR c.d.driver.core.ControlConnection - [Control connection] Cannot connect to any host, scheduling retry in 1000 milliseconds
This looks like an error connecting to Cassandra vs the other one which came from finagle. Same class of errors I admit. I can't look at this until Monday but maybe someone else can. Try pinging gitter OpenZipkin/zipkin? cc @kristofa in case you know any quick fixes off hand
PS these destinations are passed by IP address and port. If restarting changes either they wouldn't connect for that reason
Here is the top of the stack trace that i see before that error, it does seem to be trying to connect via ip and port
Exception in thread "main" com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /172.17.0.51:9042 (com.datastax.driver.core.TransportException: [/172.17.0.51:9042] Cannot connect))
query_1 | at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:240)
query_1 | at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:86)
query_1 | at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1429)
query_1 | at com.datastax.driver.core.Cluster.init(Cluster.java:162)
query_1 | at com.datastax.driver.core.Cluster.connectAsync(Cluster.java:341)
query_1 | at com.datastax.driver.core.Cluster.connectAsync(Cluster.java:314)
query_1 | at com.datastax.driver.core.Cluster.connect(Cluster.java:252)
query_1 | at org.twitter.zipkin.storage.cassandra.Repository$Schema.ensureExists(Repository.java:919)
query_1 | at org.twitter.zipkin.storage.cassandra.Repository.<init>(Repository.java:111)
So it would be interesting to see if you are able to connect to that port when this occurs, or if Cassandra is listening elsewhere. If the latter than it is a more general concern, which is to not use statically configured IP:port
Ex the docker setup in the base image includes dot files for assigning the destination for services. These could use host resolution or something less brittle instead.
One last thing I hope can get to the bottom of this. Can you verify that
when you get an error like this, that you can or cannot hit the socket
directly? ex telnet 172.17.9.51 9042
Exception in thread "main" com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /172.17.0.51:9042 (com.datastax.driver.core.TransportException: [/172.17.0.51:9042]
I'm going to spend some time on this now.. at least see if I can reproduce it
ok playing around, when zipkin-query starts before cassandra is listening.. and something zipkin-query before cassandra is listening.. the process dies
ok there are possibly two issues here.. one about zipkin-web -> zipkin-query and another between zipkin-query -> cassandra.
I can't reproduce the zipkin-web -> zipkin-query (Endpoint zipkin-query is marked down). I've tried various failures, and think it is probably best not to sink too much time into this, since zipkin-web is moving off finagle in favor of pure-javascript.
I can reproduce zipkin-query -> cassandra, and opened up an issue here: https://github.com/openzipkin/zipkin/issues/1007
@adriancole I am able to telnet to the cassandra instance after the zipkin query container dies. I can get the query instance back up by running docker-compose run query, however from the UI is saying that it cannot connect to zipkin-query, and I am not able to query data.
to update on this, I was able to get everything working again by running the following commands after zipkin starts up and the query container dies.
docker-compose restart query
docker-compose restart web
restarting query allows it to talk to casssandra and restarting web fixes the issue where web cannot talk to the query container.
it seems like this is just a startup order problem.
what if we added runit config (like we do with kafka). That would restart the process if it died for any reason, including the race condition on cassandra.
https://github.com/openzipkin/docker-zipkin/blob/master/kafka/install.sh#L19
Another solution I found was to rebuild the query and web images with a 10 second delay before starting the java process (in run.sh). Ugly, but seems to work.
One thing we chatted about on gitter was using runit to watch the process (like we do in the Kafka image)
this seems very repeatable with recent versions of docker
As I see it, the problem is that the cassandra container is seen as "up" once this happens:
cassandra | INFO 11:17:56 Node /172.18.0.2 state jump to normal
cassandra | INFO 11:17:56 Waiting for gossip to settle before accepting client requests...
but, cassandra is not ready for connections until this is logged:
cassandra | INFO 11:18:04 No gossip backlog; proceeding
cassandra | INFO 11:18:04 Netty using native Epoll event loop
cassandra | INFO 11:18:04 Using Netty Version: [netty-buffer=netty-buffer-4.0.23.Final.208198c, netty-codec=netty-codec-4.0.23.Final.208198c, netty-codec-http=netty-codec-http-4.0.23.Final.208198c, netty-codec-socks=netty-codec-socks-4.0.23.Final.208198c, netty-common=netty-common-4.0.23.Final.208198c, netty-handler=netty-handler-4.0.23.Final.208198c, netty-transport=netty-transport-4.0.23.Final.208198c, netty-transport-rxtx=netty-transport-rxtx-4.0.23.Final.208198c, netty-transport-sctp=netty-transport-sctp-4.0.23.Final.208198c, netty-transport-udt=netty-transport-udt-4.0.23.Final.208198c]
cassandra | INFO 11:18:04 Starting listening for CQL clients on /0.0.0.0:9042...
cassandra | INFO 11:18:04 Binding thrift service to /0.0.0.0:9160
cassandra | INFO 11:18:04 Listening for thrift clients...
Which happens after the query container fails to connect (obviously)
This can be handled from the Docker side of things. See https://github.com/docker/compose/issues/374 for an infinitely long discussion, started years ago and still going, on how. One way could be adding a wait_for_storage
bash function to each storage profile, and calling it from query/run.sh
. For Cassandra, it would look something like this:
tries=60
host=???
port=???
echo "Waiting at most $timeout seconds for Cassandra to start accepting CQL clients on port $port..."
while [ $tries -ge 0 ] && ! nc -z $host $port; do
echo "Waiting $tries more seconds for port $host:$port to open up..."
tries=$(expr $tries - 1)
sleep 1
done
if [ $tries -eq -1 ]; then
echo "$host$port is still not open, bailing out"
exit 1
fi
True. Sorry I have been criminally negligent on this one. I will fix the root issue now.
the root problem was really silly. This crash was only present when tracing bootstrap! The below makes self-tracing bootstrap failures log instead of crash.
fixed in 1.39.2
Heh. Thanks for the fix. (I wouldn't say you were criminally negligent though)
running docker-zipkin the first time works for me, but if I stop via CTRL-C, then run docker-compose up again, there seems to be a race condition between the zipkin-query and cassandra containers. I'm running greg@greg-elitebook ~/git/docker-zipkin $ docker-compose --version docker-compose version: 1.5.1 greg@greg-elitebook ~/git/docker-zipkin $ docker --version Docker version 1.9.1, build a34a1d5