pyr / cyanite

cyanite stores your metrics
http://cyanite.io
Other
446 stars 79 forks source link

Old Gen constantly increasing #251

Closed sokratisg closed 7 years ago

sokratisg commented 7 years ago

We are currently evaluating Cyanite as our main carbon backend for production use.

Tried running latest releases which seems to deprecate Elasticsearch indexing in favor of Cassandra which actually seems pretty cool (one less dependency). All soakTest runs have been done using latest (master) release + Cassandra 3.0.7 (Datastax package).

Unfortunately it seems that we're on the same situation as described in this issue comment. Constant increase of JVM's Old Gen until Cyanite becomes unresponsive.

Tried different GC scenarios, different soakTest mpm (10000 up to 30000), different heap sizes, different queue size settings in conf.yml but always the same result.

Is anyone else running Cyanite in production and if yes, what is your metric volume and your setup settings?

ifesdjeen commented 7 years ago

We know installations that serve ~60K metrics per second on ingest side.

3.0.7 wouldn't really work with Cyanite as we require more recent Cassandra for index (starting with 3.5, with SASI) and in-memory index won't really work exactly due to the promotion problem you have described.

I assume you're running in-memory index, is that right? Also, would be nice to see the config if possible and know which exactly version you're running.

sokratisg commented 7 years ago

Actually sorry for the typo, I am testing Cyanite with Cassandra 3.7.

Bellow is the cyanite configuration I've been using so far:

logging:
  level: debug
  console: true
  files:
    - "/opt/cyanite/log/cyanite.log"
  overrides:
    io.cyanite: "debug"

drift:
  type: no-op

input:
  - type: "carbon"
    port: 2003
    host: 127.0.0.1

api:
  port: 8080
  host: 0.0.0.0

store:
  keyspace: "metric"
  cluster: 127.0.0.1

index:
  type: cassandra
  cluster: 127.0.0.1

queues:
  defaults:
    ingestq:
      pool-size: 100
      queue-capacity: 2000000
    writeq:
      pool-size: 100
      queue-capacity: 2000000

engine:
  rules:
    '^carbon\..*': [ "10s:1d", "5m:7d", "15m:30d" ]
    default: [ "60s:1d" ]

And here is how I start cyanite

java \
  -server \
  -ea \
  -Xms2G \
  -Xmx2G \
  -XX:+UseCompressedOops \
  -XX:+UseFastAccessorMethods \
  -Djava.net.preferIPv4Stack=true \
  -Xloggc:/opt/cyanite/log/gc.log \
  -XX:+PrintGCDetails \
  -XX:+PrintGCDateStamps \
  -XX:+PrintHeapAtGC \
  -XX:+PrintTenuringDistribution \
  -XX:+PrintGCApplicationStoppedTime \
  -XX:+PrintPromotionFailure \
  -XX:+UseGCLogFileRotation \
  -XX:NumberOfGCLogFiles=5 \
  -XX:GCLogFileSize=10M \
  -Dcom.sun.management.jmxremote \
  -Dcom.sun.management.jmxremote.port=9999 \
  -Dcom.sun.management.jmxremote.rmi.port=9999 \
  -Dcom.sun.management.jmxremote.ssl=false \
  -Dcom.sun.management.jmxremote.authenticate=false \
  -XX:+UseG1GC \
  -XX:MaxGCPauseMillis=200 \
  -XX:ParallelGCThreads=8 \
  -XX:ConcGCThreads=8 \
  -XX:+AlwaysPreTouch \
  -jar /opt/cyanite/bin/cyanite-0.5.1-standalone.jar \
  -f /opt/cyanite/etc/config.yml
ifesdjeen commented 7 years ago

Thank you. Could you also tell me which branch / RSA you built your 0.5.1 jar from?

sokratisg commented 7 years ago
cyanite# git branch   
* (detached from 7cef5e7)
  master

Couldn't use the latest snapshot as it caused "Queue full" messages instant on startup.

ifesdjeen commented 7 years ago

"Queue full" messages on startup mostly mean that messages were dropped in order to avoid node crash, which only means you have to increase your queue size in config:

queue:
  queue-capacity: 1048576

(or any larger power of two).

This would (usually) be enough to handle load of 10-20K events per second without building up too big of a backlog.

I agree default capacity is unreasonably small. Sorry about that, I'll fix that on master.

I've ran a long-running stress test and over the course of 4 days I did not notice any memory leak: after major GC memory goes back to normal.

If you generate new metric name every time, this may cause cyanite to OOM, since we do not have a notion of expiring metrics (yet) and do not GC metrics. For reasonable stress with realistic graphite loads you could use graphite-stresser.

I'm happy to help, but I do not have enough information to debug this memory leak.

sokratisg commented 7 years ago

The stress test was run using soakTest which I think generates a lot of different metrics (or not?). Can you confirm whether the test scenario was wrong compared to current code?

ifesdjeen commented 7 years ago

I can't really confirm since I do not really know what the test scenario was...

sokratisg commented 7 years ago

soakTest script included in your repo with varying mpm 10000-30000 but problem remained the same.

ifesdjeen commented 7 years ago

Sorry, I can not reproduce the problem. We have instances running for weeks stably without OOM. You can use flight recorder to profile the heap and learn which objects are causing the problem.

But once again, if you're not running the latest master, I can't really help and/or debug on the diverged code.

ifesdjeen commented 7 years ago

Configuration fixed, can't reproduce on master. Closing together with #252