pyr / cyanite

cyanite stores your metrics
http://cyanite.io
Other
446 stars 79 forks source link

Cyanite gets stuck #252

Closed pawellesniewski closed 7 years ago

pawellesniewski commented 7 years ago

We have a problem running Cyanite for more than ~ one day. It gets stuck, i mean it receives metrics:

cyanite-metrics

around 1.3k metrics per second (different metrics, there could be a lot of new metrics every hour). But datapoints are no longer saved to cassandra, no new datapoints are available through api:

{

    "from": 1477039680,
    "to": 1477040280,
    "step": 60,
    "series": {
        "xxx.some.metric.index.GET.2xx.count": [
            5139613.0,
            5139613.0,
            5139613.0,
            5139613.0,
            null,
            null,
            null,
            null,
            null,
            null,
            null
        ]
    }

}

tcpdump and graphite (replicated traffic) confirms that metrics are send.

Cyanite storage writer decreases to 0:

cyanite-metrics2

PS-Old-Gen.used looks similar to case describe in #251

cyanite-metric3

Cyanite: version 0.5.1, master branch, latest commit: #e5da4e8 Java version:

java version "1.8.0_111"
Java(TM) SE Runtime Environment (build 1.8.0_111-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode)

Run command:

JAR="$EXTRA_CLASSPATH:/usr/lib/cyanite/cyanite.jar"
CONFIG="/etc/cyanite/cyanite.yaml"
exec java \
    -server \
    -ea \
    -Xms4g \
    -Xmx12g \
    -Xloggc:/var/log/cyanite/gc.log \
    -XX:+PrintGCDetails \
    -XX:+PrintGCDateStamps \
    -XX:+PrintHeapAtGC \
    -XX:+PrintTenuringDistribution \
    -XX:+PrintGCApplicationStoppedTime \
    -XX:+PrintPromotionFailure \
    -XX:+UseGCLogFileRotation \
    -XX:NumberOfGCLogFiles=10 \
    -XX:GCLogFileSize=100M \
    -Dcom.sun.management.jmxremote \
    -Dcom.sun.management.jmxremote.port=9999 \
    -Dcom.sun.management.jmxremote.rmi.port=9999 \
    -Dcom.sun.management.jmxremote.ssl=false \
    -Dcom.sun.management.jmxremote.authenticate=false \
    -cp "$JAR" io.cyanite \
    -f "$CONFIG"

Configuration:

queues:
  defaults:
    ingestq:
      pool-size: 16
      queue-capacity: 2000000
    writeq:
      pool-size: 16
      queue-capacity: 2000000
reporter:
  metrics:
    reporters:
      graphite:
        interval: 60
        opts:
          host: graphite.local
          port: 2003
          prefix: myprefix
engine:
  rules:
    # "web.*\\.cpu": [ "5s:1h", "30s:1d" ]
    default: [ "60s:3d", "10m:7d" ]
api:
  host: 0.0.0.0
  port: 8080
  disabled: false
input:
  - type: carbon
    host: 0.0.0.0
    port: 2003
index:
  type: cassandra
  cluster:
    - cassanra1
    - cassanra2
    - cassanra3
    - cassanra4
  keyspace: 'metric'
drift:
  type: agent
store:
  cluster:
    - casssandra1
    - casssandra2
    - casssandra3
    - casssandra4
  keyspace: 'metric'
logging:
  level: info
  console: true
  files:
    - "/var/log/cyanite/cyanite.log"

Some GC logs (GC (Allocation Failure)):

2016-10-21T10:51:04.461+0200: 69744.098: Total time for which application threads were stopped: 0.0004041 seconds, Stopping threads took: 0.0000699 seconds
{Heap before GC invocations=6720 (full 4):
 PSYoungGen      total 4176896K, used 4170656K [0x00000006c0000000, 0x00000007c0000000, 0x00000007c0000000)
  eden space 4160512K, 100% used [0x00000006c0000000,0x00000007bdf00000,0x00000007bdf00000)
  from space 16384K, 61% used [0x00000007bdf00000,0x00000007be8e8000,0x00000007bef00000)
  to   space 16896K, 0% used [0x00000007bef80000,0x00000007bef80000,0x00000007c0000000)
 ParOldGen       total 5066240K, used 2459755K [0x00000004c0000000, 0x00000005f5380000, 0x00000006c0000000)
  object space 5066240K, 48% used [0x00000004c0000000,0x000000055621ac60,0x00000005f5380000)
 Metaspace       used 45340K, capacity 45839K, committed 46208K, reserved 1087488K
  class space    used 7872K, capacity 7988K, committed 8064K, reserved 1048576K
2016-10-21T10:51:09.301+0200: 69748.937: [GC (Allocation Failure)
Desired survivor size 16252928 bytes, new threshold 1 (max 15)
[PSYoungGen: 4170656K->6656K(4176384K)] 6630411K->2469027K(9242624K), 0.1634400 secs] [Times: user=1.25 sys=0.00, real=0.16 secs]
Heap after GC invocations=6720 (full 4):
 PSYoungGen      total 4176384K, used 6656K [0x00000006c0000000, 0x00000007bfe80000, 0x00000007c0000000)
  eden space 4161024K, 0% used [0x00000006c0000000,0x00000006c0000000,0x00000007bdf80000)
  from space 15360K, 43% used [0x00000007bef80000,0x00000007bf600000,0x00000007bfe80000)
  to   space 15872K, 0% used [0x00000007bdf80000,0x00000007bdf80000,0x00000007bef00000)
 ParOldGen       total 5066240K, used 2462371K [0x00000004c0000000, 0x00000005f5380000, 0x00000006c0000000)
  object space 5066240K, 48% used [0x00000004c0000000,0x00000005564a8c60,0x00000005f5380000)
 Metaspace       used 45340K, capacity 45839K, committed 46208K, reserved 1087488K
  class space    used 7872K, capacity 7988K, committed 8064K, reserved 1048576K
}

We are getting lots of "Queue full":

WARN [2016-10-21 13:55:58,368] nioEventLoopGroup-2-2 - io.netty.channel.DefaultChannelPipeline An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.lang.IllegalStateException: Queue full
    at java.util.AbstractQueue.add(AbstractQueue.java:98) ~[na:1.8.0_111]
    at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) ~[na:na]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_111]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_111]
    at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93) ~[cyanite.jar:na]
    at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28) ~[cyanite.jar:na]
    at io.cyanite.engine.queue.EngineQueue.engine_event_BANG_(queue.clj:40) ~[cyanite.jar:na]
    at io.cyanite.engine.Engine.enqueue_BANG_(engine.clj:108) ~[cyanite.jar:na]
    at io.cyanite.input.carbon$pipeline$fn__6635.invoke(carbon.clj:40) ~[cyanite.jar:na]
    at io.cyanite.input.carbon.proxy$io.netty.channel.ChannelInboundHandlerAdapter$ff19274a.channelRead(Unknown Source) ~[cyanite.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:307) [cyanite.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:293) [cyanite.jar:na]
    at io.netty.handler.timeout.ReadTimeoutHandler.channelRead(ReadTimeoutHandler.java:152) [cyanite.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:307) [cyanite.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:293) [cyanite.jar:na]
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) [cyanite.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:307) [cyanite.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:293) [cyanite.jar:na]
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:276) [cyanite.jar:na]
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:354) [cyanite.jar:na]
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244) [cyanite.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:307) [cyanite.jar:na]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:293) [cyanite.jar:na]
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:840) [cyanite.jar:na]
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) [cyanite.jar:na]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) [cyanite.jar:na]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) [cyanite.jar:na]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) [cyanite.jar:na]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) [cyanite.jar:na]
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) [cyanite.jar:na]
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) [cyanite.jar:na]
    at java.lang.Thread.run(Thread.java:745) [na:1.8.0_111]

if you need any additional data/logs please write :)

ifesdjeen commented 7 years ago

"Queue full" messages on startup mostly mean that messages were dropped in order to avoid node crash, which only means you have to increase your queue size in config:

queue:
  queue-capacity: 1048576

I'll change the default (which is way to low) today. Sorry about that.

pawellesniewski commented 7 years ago

Hi @ifesdjeen , thanks for Your reply.

I have queue-capacity set to 2.000.000, i get this messages but Cyanite hangs after several hours. Is this related with queue size?

ifesdjeen commented 7 years ago

No, this is not normal at all. Which commit / RSA are you on? This sounds as if you're not running latest version, as we've already fixed the problem you describe.

ifesdjeen commented 7 years ago

Also, please note that config structure for queue has changed, it's not split into write and ingest queues anymore (if you're running latest master)

pawellesniewski commented 7 years ago
Cyanite: version 0.5.1, master branch, latest commit: #e5da4e8

build deb with lein tool.

ifesdjeen commented 7 years ago

Yes, this is an outdated version.

pawellesniewski commented 7 years ago

I am a little bit confused, so what is the current version (commit)?

ifesdjeen commented 7 years ago

https://github.com/pyr/cyanite/commit/9d49e8abc6db746577687ae0d865b6351f7c056e (current master).

We're working on publishing artefacts.

pawellesniewski commented 7 years ago

Sory, i pasted wrong one (Merge commit id): I build from:

commit 9d49e8abc6db746577687ae0d865b6351f7c056e
Merge: e5da4e8 bf1e05c
Author: αλεx π <alexp@coffeenco.de>
Date:   Sat Oct 1 22:51:42 2016 +0200
ifesdjeen commented 7 years ago

In this case, your configuration is for the old scheduling system. You should use the new format:

queue:
  queue-capacity: 1048576

As there is just one queue now...

pawellesniewski commented 7 years ago

Thanks a lot, i will change the configuration and see what will happen during the weekend.

ifesdjeen commented 7 years ago

👍 thanks!