Queue full error when trying to send data to Cyanite

dancb10 commented 6 years ago

Hello, We are trying Cyanite for the first time in our testing environment with two Cyanite+Graphite API instances (16G RAM, 8 Cores and SSD) and a three node Cassandra cluster (64G RAM, 8 Cores and SSD) but it seems it cannot handle the load. We see a lot of queue full errors in the Cyanite config file when starting our load test. We are using graphite stress tool to load test the environment:

java8 -jar build/libs/graphite-stresser-0.1.jar loadbalancer 2003 100 975 10 true

We have set the heap size of Cyanite to 12G

java -Xms12g -Xmx12g -jar cyanite-0.5.1-standalone-fix.jar --path cyanite.yaml

The cyanite.yaml file also contains the following ingest and write queues settings

queues:
  defaults:
    ingestq:
      pool-size: 100
      queue-capacity: 2000000
    writeq:
      pool-size: 100
      queue-capacity: 2000000

It seems that it barely starts the load test and the following errors are sent continuously. NOTE that if even when we stop the load test the log file just keeps sending this type of errors non stop:

WARN [2017-12-08 09:46:13,337] nioEventLoopGroup-2-15 - io.netty.channel.DefaultChannelPipeline An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.lang.IllegalStateException: Queue full
    at java.util.AbstractQueue.add(AbstractQueue.java:98)
    at io.cyanite.engine.queue.EngineQueue.engine_event_BANG_(queue.clj:44)
    at io.cyanite.engine.Engine.enqueue_BANG_(engine.clj:108)
    at io.cyanite.input.carbon$pipeline$fn__16369.invoke(carbon.clj:40)
    at io.cyanite.input.carbon.proxy$io.netty.channel.ChannelInboundHandlerAdapter$ff19274a.channelRead(Unknown Source)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
    at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:280)
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:396)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:642)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:565)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:479)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:441)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
    at java.lang.Thread.run(Thread.java:745)

Do you have any ides how to fix this? @pyr @ifesdjeen . I see that this issue was addressed here previously but, has it been fixed completely? Note that we have compiled Cyanite until this commit because the latest version does not return multiple metrics in a single query. I have submitted a issue for this bug as well.

dancb10 commented 6 years ago

We have tried load testing with the following data: hosts timers interval 10 6415 10 - 9600 per 10 seconds OK 10 97515 10 - 146250 per 10 seconds OK 10 195615 10 - 293400 per 10 seconds FAILED

This was done with two Cyanite nodes (16G RAM, 8 Cores and SSD). Cyanite daemon fails while we still have plenty of resources.

ifesdjeen commented 6 years ago

@dancb10 you can increase queue capacity.

What's your ingestion rate? How many events per second does Cyanite get approximately?

dancb10 commented 6 years ago

The number of events are written in the last commit. So there are 10 instances each sending 1956*10 number of metrics every 10 seconds. So there are 293400 number of metrics send every 10 seconds, that's 29340 each second. Note that we are using pretty powerful instances and we have also split the writes and reads into multiple instances. So we have two instances that write two that read each with its own elb. Instances are c3.2xlarge | 8 | 15 | 2 x 80 We are using 2 million queue size so @ifesdjeen you are saying to increase this number? I will try with 20 million.

ifesdjeen commented 6 years ago

We are using 2 million queue size so @ifesdjeen you are saying to increase this number?

Hm, no actually 2M should be usually ok...

pyr / cyanite

Queue full error when trying to send data to Cyanite #284