pyr / cyanite

cyanite stores your metrics
http://cyanite.io
Other
446 stars 79 forks source link

Hard cyanite recovery after failure #285

Open dancb10 opened 6 years ago

dancb10 commented 6 years ago

We have stressed our Cyanite infrastructure to verify what is the tipping point when it will fail. Once we stop the testing it seems like Cyanite gets blocked and only a restart gets it up and running. We divided Cyanite nodes into read and writes machines and we've seen the following:

WRITE nodes We have situations in which with 12 G of heap Cyanite throws out of memory errors. I'm not sure if this is a leak:

java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.lang.reflect.Method.copy(Method.java:147)
    at java.lang.reflect.ReflectAccess.copyMethod(ReflectAccess.java:140)
    at sun.reflect.ReflectionFactory.copyMethod(ReflectionFactory.java:302)
    at java.lang.Class.searchMethods(Class.java:3005)
    at java.lang.Class.privateGetMethodRecursive(Class.java:3040)
    at java.lang.Class.getMethod0(Class.java:3010)
    at java.lang.Class.getMethod(Class.java:1776)
    at clojure.lang.Reflector.getMethods(Reflector.java:385)
    at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:27)
    at io.cyanite.engine.MetricResolution.ingest_BANG_(engine.clj:54)
    at io.cyanite.engine.Engine.ingest_BANG_(engine.clj:105)
    at io.cyanite.engine$fn__12854$G__12850__12857.invoke(engine.clj:19)
    at io.cyanite.engine$fn__12854$G__12849__12861.invoke(engine.clj:19)
    at clojure.core$partial$fn__6855.invoke(core.clj:2597)
    at io.cyanite.engine.queue.EngineQueue$fn__12124$fn__12125.invoke(queue.clj:57)
    at io.cyanite.engine.queue.EngineQueue$fn__12124.invoke(queue.clj:53)
    at clojure.lang.AFn.call(AFn.java:18)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

READ nodes There were a lot of read timeouts on the Cyanite instances:

com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded)
    at com.datastax.driver.core.exceptions.ReadTimeoutException.copy(ReadTimeoutException.java:88)
    at com.datastax.driver.core.exceptions.ReadTimeoutException.copy(ReadTimeoutException.java:25)
    at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
    at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245)
    at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:68)
    at qbits.alia$execute.invokeStatic(alia.clj:384)
    at qbits.alia$execute.invoke(alia.clj:326)
    at qbits.alia$execute.invokeStatic(alia.clj:392)
    at io.cyanite.index.cassandra$native_sasi_index.invokeStatic(cassandra.clj:79)
    at io.cyanite.index.cassandra$native_sasi_index.invoke(cassandra.clj:62)
    at io.cyanite.index.cassandra$load_prefixes_fn.invokeStatic(cassandra.clj:100)
    at io.cyanite.index.cassandra.CassandraIndex$reify__16230.load(cassandra.clj:129)
    at com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalLoadingCache.lambda$new$0(BoundedLocalCache.java:3070)
    at com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalLoadingCache$$Lambda$4/1846944624.apply(Unknown Source)
    at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:1895)
    at com.github.benmanes.caffeine.cache.BoundedLocalCache$$Lambda$5/1816062018.apply(Unknown Source)
    at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
    at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:1893)
    at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:1876)
    at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:113)
    at com.github.benmanes.caffeine.cache.LocalLoadingCache.get(LocalLoadingCache.java:67)
    at io.cyanite.index.cassandra.CassandraIndex.prefixes(cassandra.clj:152)
    at io.cyanite.api$fn__15748.invokeStatic(api.clj:119)
    at io.cyanite.api$fn__15748.invoke(api.clj:109)
    at clojure.lang.MultiFn.invoke(MultiFn.java:229)
    at io.cyanite.api$process.invokeStatic(api.clj:89)
    at io.cyanite.api$make_handler$fn__15791.invoke(api.clj:162)
    at io.cyanite.http$request_handler$fn__15485.invoke(http.clj:110)
    at io.cyanite.http$netty_handler$fn__15493.invoke(http.clj:125)
    at io.cyanite.http.proxy$io.netty.channel.ChannelInboundHandlerAdapter$ff19274a.channelRead(Unknown Source)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
    at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:435)
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
    at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:250)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
    at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:1018)
    at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:394)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:299)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
    at java.lang.Thread.run(Thread.java:745)
Caused by: com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded)
    at com.datastax.driver.core.exceptions.ReadTimeoutException.copy(ReadTimeoutException.java:115)
    at com.datastax.driver.core.Responses$Error.asException(Responses.java:124)
    at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:506)
    at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1070)
    at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:993)
    at com.datastax.shaded.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
    at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
    at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321)
    at com.datastax.shaded.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
    at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
    at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321)
    at com.datastax.shaded.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
    at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
    at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321)
    at com.datastax.shaded.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
    at com.datastax.shaded.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
    at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
    at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:321)
    at com.datastax.shaded.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1280)
    at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:342)
    at com.datastax.shaded.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:328)
    at com.datastax.shaded.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:890)
    at com.datastax.shaded.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
    at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:564)
    at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:505)
    at com.datastax.shaded.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:419)
    at com.datastax.shaded.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:391)
    at com.datastax.shaded.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
    at com.datastax.shaded.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:145)
    ... 1 common frames omitted
Caused by: com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded)
    at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:62)
    at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:37)
    at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:289)
    at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:269)
    at com.datastax.shaded.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
    ... 20 common frames omitted
ERROR [2017-12-11 11:06:13,589] epollEventLoopGroup-3-1 - io.cyanite.api could not process request
com.datastax.driver.core.exceptions.OperationTimedOutException: [va6-qe-pcs-pcs3gw-3/172.27.39.225:9042] Timed out waiting for server response
    at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:44)
    at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:26)
    at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
    at com.datastax.driver.core.ArrayBackedResultSet$MultiPage.prepareNextRow(ArrayBackedResultSet.java:313)
    at com.datastax.driver.core.ArrayBackedResultSet$MultiPage.one(ArrayBackedResultSet.java:275)
    at qbits.alia.codec$lazy_result_set_.invokeStatic(codec.clj:27)
    at qbits.alia.codec$lazy_result_set_.invoke(codec.clj:25)
    at qbits.alia.codec$lazy_result_set_$fn__13109.invoke(codec.clj:28)
    at clojure.lang.LazySeq.sval(LazySeq.java:40)
    at clojure.lang.LazySeq.seq(LazySeq.java:49)
    at clojure.lang.RT.seq(RT.java:525)
    at clojure.core$seq__6422.invokeStatic(core.clj:137)
    at clojure.core$map$fn__6881.invoke(core.clj:2719)
    at clojure.lang.LazySeq.sval(LazySeq.java:40)
    at clojure.lang.LazySeq.seq(LazySeq.java:49)
    at clojure.lang.RT.seq(RT.java:525)
    at clojure.core$seq__6422.invokeStatic(core.clj:137)
    at clojure.core$map$fn__6881.invoke(core.clj:2719)
    at clojure.lang.LazySeq.sval(LazySeq.java:40)
    at clojure.lang.LazySeq.seq(LazySeq.java:49)
    at clojure.lang.RT.seq(RT.java:525)
    at clojure.core$seq__6422.invokeStatic(core.clj:137)
    at clojure.core$map$fn__6881.invoke(core.clj:2719)
    at clojure.lang.LazySeq.sval(LazySeq.java:40)
    at clojure.lang.LazySeq.seq(LazySeq.java:49)
    at clojure.lang.RT.seq(RT.java:525)
    at clojure.core$seq__6422.invokeStatic(core.clj:137)
    at clojure.core$map$fn__6881.invoke(core.clj:2719)
    at clojure.lang.LazySeq.sval(LazySeq.java:40)
    at clojure.lang.LazySeq.seq(LazySeq.java:49)
    at clojure.lang.Cons.next(Cons.java:39)
    at clojure.lang.RT.next(RT.java:703)
    at clojure.core$next__6406.invokeStatic(core.clj:64)
    at clojure.core$concat$cat__6515$fn__6516.invoke(core.clj:734)
    at clojure.lang.LazySeq.sval(LazySeq.java:40)
    at clojure.lang.LazySeq.seq(LazySeq.java:49)
    at clojure.lang.RT.seq(RT.java:525)
    at clojure.core$seq__6422.invokeStatic(core.clj:137)
    at clojure.core$map$fn__6881.invoke(core.clj:2719)
    at clojure.lang.LazySeq.sval(LazySeq.java:40)
    at clojure.lang.LazySeq.seq(LazySeq.java:49)
    at clojure.lang.Cons.next(Cons.java:39)
    at clojure.lang.RT.next(RT.java:703)
    at clojure.core$next__6406.invokeStatic(core.clj:64)
    at clojure.core$concat$cat__6515$fn__6516.invoke(core.clj:734)
    at clojure.lang.LazySeq.sval(LazySeq.java:40)
    at clojure.lang.LazySeq.seq(LazySeq.java:56)
    at clojure.lang.RT.seq(RT.java:525)
    at clojure.core$seq__6422.invokeStatic(core.clj:137)
    at clojure.core$filter$fn__6908.invoke(core.clj:2782)
    at clojure.lang.LazySeq.sval(LazySeq.java:40)
    at clojure.lang.LazySeq.seq(LazySeq.java:49)
    at clojure.lang.RT.seq(RT.java:525)
    at clojure.core$seq__6422.invokeStatic(core.clj:137)
    at clojure.core$map$fn__6881.invoke(core.clj:2719)
    at clojure.lang.LazySeq.sval(LazySeq.java:40)
    at clojure.lang.LazySeq.seq(LazySeq.java:49)
    at clojure.lang.Cons.next(Cons.java:39)
    at clojure.lang.RT.next(RT.java:703)
    at clojure.core$next__6406.invokeStatic(core.clj:64)
    at clojure.core$reduce1.invokeStatic(core.clj:936)
    at clojure.core$set.invokeStatic(core.clj:4065)
    at globber.glob$filter_compound_ast.invokeStatic(glob.clj:313)
    at globber.glob$glob.invokeStatic(glob.clj:355)
    at io.cyanite.index.cassandra$load_prefixes_fn.invokeStatic(cassandra.clj:100)
    at io.cyanite.index.cassandra.CassandraIndex$reify__16230.load(cassandra.clj:129)
    at com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalLoadingCache.lambda$new$0(BoundedLocalCache.java:3070)
    at com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalLoadingCache$$Lambda$4/1846944624.apply(Unknown Source)
    at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:1895)
    at com.github.benmanes.caffeine.cache.BoundedLocalCache$$Lambda$5/1816062018.apply(Unknown Source)
    at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
    at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:1893)
    at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:1876)
    at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:113)
    at com.github.benmanes.caffeine.cache.LocalLoadingCache.get(LocalLoadingCache.java:67)
    at io.cyanite.index.cassandra.CassandraIndex.prefixes(cassandra.clj:152)
    at io.cyanite.api$fn__15748.invokeStatic(api.clj:119)
    at io.cyanite.api$fn__15748.invoke(api.clj:109)
    at clojure.lang.MultiFn.invoke(MultiFn.java:229)
    at io.cyanite.api$process.invokeStatic(api.clj:89)
    at io.cyanite.api$make_handler$fn__15791.invoke(api.clj:162)
    at io.cyanite.http$request_handler$fn__15485.invoke(http.clj:110)
    at io.cyanite.http$netty_handler$fn__15493.invoke(http.clj:125)
    at io.cyanite.http.proxy$io.netty.channel.ChannelInboundHandlerAdapter$ff19274a.channelRead(Unknown Source)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
    at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:435)
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
    at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:250)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
    at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:1018)
    at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:394)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:299)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
    at java.lang.Thread.run(Thread.java:745)
Caused by: com.datastax.driver.core.exceptions.OperationTimedOutException: [va6-qe-pcs-pcs3gw-3/172.27.39.225:9042] Timed out waiting for server response
    at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:772)
    at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1374)
    at com.datastax.shaded.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
    at com.datastax.shaded.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:655)
    at com.datastax.shaded.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
    at com.datastax.shaded.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:145)
    ... 1 common frames omitted

CASSANDRA nodes: We've seen index search issues:

org.apache.cassandra.index.sasi.exceptions.TimeQuotaExceededException: null
        at org.apache.cassandra.index.sasi.plan.QueryController.checkpoint(QueryController.java:158) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.plan.Expression.checkpoint(Expression.java:320) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.disk.OnDiskIndex.searchPoint(OnDiskIndex.java:392) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.disk.OnDiskIndex.searchRange(OnDiskIndex.java:296) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.disk.OnDiskIndex.search(OnDiskIndex.java:254) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.SSTableIndex.search(SSTableIndex.java:103) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.TermIterator.lambda$build$0(TermIterator.java:130) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.TermIterator$$Lambda$316/1683221136.run(Unknown Source) [apache-cassandra-3.11.1.jar:3.11.1]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_45]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_45]
        at com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:299) [guava-18.0.jar:na]
        at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) [na:1.8.0_45]
        at com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:50) [guava-18.0.jar:na]
        at com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:37) [guava-18.0.jar:na]
        at org.apache.cassandra.index.sasi.TermIterator.build(TermIterator.java:125) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.plan.QueryController.getIndexes(QueryController.java:145) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.plan.Operation$Builder.complete(Operation.java:433) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.plan.QueryPlan.analyze(QueryPlan.java:57) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.plan.QueryPlan.execute(QueryPlan.java:68) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.SASIIndex.lambda$searcherFor$2(SASIIndex.java:290) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.SASIIndex$$Lambda$297/2047499964.search(Unknown Source) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.db.ReadCommand.executeLocally(ReadCommand.java:418) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1884) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2594) [apache-cassandra-3.11.1.jar:3.11.1]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_45]
        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.11.1.jar:3.11.1]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
DEBUG [Native-Transport-Requests-1] 2017-12-11 11:07:31,022 ReadCallback.java:132 - Timed out; received 0 of 1 responses
INFO  [Service Thread] 2017-12-11 11:07:31,023 StatusLogger.java:47 - Pool Name                    Active   Pending      Completed   Blocked  All Time Blocked
WARN  [ReadStage-2] 2017-12-11 11:07:31,025 AbstractLocalAwareExecutorService.java:167 - Uncaught exception on thread Thread[ReadStage-2,5,main]: {}
java.lang.RuntimeException: org.apache.cassandra.index.sasi.exceptions.TimeQuotaExceededException
        at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2598) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_45]
        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134) [apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.11.1.jar:3.11.1]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
Caused by: org.apache.cassandra.index.sasi.exceptions.TimeQuotaExceededException: null
        at org.apache.cassandra.index.sasi.plan.QueryController.checkpoint(QueryController.java:158) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.plan.Expression.checkpoint(Expression.java:320) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.TermIterator.build(TermIterator.java:157) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.plan.QueryController.getIndexes(QueryController.java:145) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.plan.Operation$Builder.complete(Operation.java:433) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.plan.QueryPlan.analyze(QueryPlan.java:57) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.plan.QueryPlan.execute(QueryPlan.java:68) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.SASIIndex.lambda$searcherFor$2(SASIIndex.java:290) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.index.sasi.SASIIndex$$Lambda$297/2047499964.search(Unknown Source) ~[na:na]
        at org.apache.cassandra.db.ReadCommand.executeLocally(ReadCommand.java:418) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1884) ~[apache-cassandra-3.11.1.jar:3.11.1]
        at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2594) ~[apache-cassandra-3.11.1.jar:3.11.1]
        ... 5 common frames omitted

Besides this it seems that when performing load tests, Cyanite does loose requests. The load test was performed with 58680 metrics per second on an infrastructure composed of 4 Cyanite nodes (2 for read, 2 for write) each with 8 cores and 16G of RAM and three node Cassandra cluster (8 cores, 64G of RAM) The thing is that we have performed multiple tests with multiple numbers and we've seen a pattern in how Cyanite behaves after it fails. So, we've tested with 29340 requests/s it everything was OK, it performed good. Since we had really low load on our instances we tried with 58680 requests/s which made Cyanite fail. But then we stopped the load test and tried again with 29340 requests/s but it never recovered, this test failed as well. So I'm not sure if there is a queue limit or a write bottleneck or a bug in the Clojure code. Unfortunately I don't know Clojure and it makes debugging hard. So the only way in which we can get Cyanite back again working is by restarting the process on all instances. Cassandra doesn't seem to be the problem here because it handled really good the load plus it accepted data back without any problems once Cyanite became healthy again. NOTE that in all load tests performed machines had enough resources left and the load was not heavy. Do you have any numbers or "best practices" in terms of usage and hardware specs for Cyanite? We don't know if scaling horizontally/vertically can fix our problems or if there's a bug.