openzipkin / zipkin-support

repository for support questions raised as issues
4 stars 2 forks source link

Cassandra NoHostAvailableException connection time out #5

Open alseddnm opened 6 years ago

alseddnm commented 6 years ago

We are using mesos/marathon to manage our docker containers, zipkin ran fine for 15 mins or less -> then heath check starts failing. we found a bunch of errors in our service log : cannot load service names: Request processing failed; nested exception is com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /xxxxx:9042 (com.datastax.driver.core.exceptions.TransportException: [/1xxxx:9042] Connection has been closed),/(com.datastax.driver.core.exceptions.TransportException: [xyz/10.124.8.97:9042] Connection has been closed))

at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84) ~[cassandra-driver-core-3.5.0-shaded.jar!/:?]
at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:37) ~[cassandra-driver-core-3.5.0-shaded.jar!/:?]
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37) ~[cassandra-driver-core-3.5.0-shaded.jar!/:?]
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245) ~[cassandra-driver-core-3.5.0-shaded.jar!/:?]
at zipkin2.storage.cassandra.internal.call.ResultSetFutureCall.getUninterruptibly(ResultSetFutureCall.java:74) ~[zipkin-storage-cassandra-2.9.1.jar!/:?]
at at zipkin2.storage.cassandra.internal.call.ResultSetFutureCall.getUninterruptibly(ResultSetFutureCall.java:74) ~[zipkin-storage-cassandra-2.9.1.jar!/:?]
at zipkin2.storage.cassandra.internal.call.ResultSetFutureCall$1CallbackListener.run(ResultSetFutureCall.java:50) [zipkin-storage-cassandra-2.9.1.jar!/:?]
at zipkin2.storage.cassandra.internal.call.DirectExecutor.execute(DirectExecutor.java:23) [zipkin-storage-cassandra-2.9.1.jar!/:?]````

I thought is better to open an issue, we are investigating on our side as well.

I did also notice zipkin cassandra is using SASI index and per datastax doc? SASI indexes in DSE are experimental. DataStax does not support SASI indexes for production https://docs.datastax.com/en/dse/5.1/cql/cql/cql_using/useSASIIndex.html.

codefromthecrypt commented 6 years ago

how many service names do you have? how many spans do you have per day?

This might help someone help answer

alseddnm commented 6 years ago

@adriancole Just saw your message, Not even able to access our c nodes this morning.. I see a bunch of errors in c logs Can't open index file at /cassandra/data/zipkin2/span-15bb5b006e7111e8a8d2af46ca93ec1b/mc-2673-big-SI_span_l_service_idx.db, skipping. org.apache.cassandra.io.FSReadError: java.io.EOFException at org.apache.cassandra.index.sasi.disk.OnDiskIndex.<init>(OnDiskIndex.java:164) ~[apache-cassandra-3.9.0.jar:3.9.0] at org.apache.cassandra.index.sasi.SSTableIndex.<init>(SSTableIndex.java:68) ~[apache-cassandra-

ERROR [Reference-Reaper:1] 2018-06-14 14:46:11,044 Ref.java:224 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74bce35c) to class org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@2054886331:/cassandra/data/zipkin2/span-15bb5b006e7111e8a8d2af46ca93ec1b/mc-2673-big was not released before the reference was garbage collected

alseddnm commented 6 years ago

as of now we don't have much service names I see only 292 distinct service in the table total records are 404118 select count(*) from span_by_service;

count

404118

codefromthecrypt commented 6 years ago

I'm no expert but it looks like you are using cassandra 3.9 which is likely to not work well. We use 3.11 and in fact the 3.11.3 will give the best results when released (which is shortly).

shakuzen commented 6 years ago

@alseddnm are you able to try with more recent versions to see if things work better?

ukreddy-erwin commented 4 years ago

same issue even with official cassandra docker image latest one

codefromthecrypt commented 4 years ago

protip: adding comments to old issues about a troubleshooting scenario isn't usually something that results in an outcome. try poking on https://gitter.im/openzipkin/zipkin or including actual error message especially what "does" work for example if the /health endpoint works (which if not is a more fundamental problem)