Closed sokratisg closed 7 years ago
We know installations that serve ~60K metrics per second on ingest side.
3.0.7 wouldn't really work with Cyanite as we require more recent Cassandra for index (starting with 3.5, with SASI) and in-memory index won't really work exactly due to the promotion problem you have described.
I assume you're running in-memory index, is that right? Also, would be nice to see the config if possible and know which exactly version you're running.
Actually sorry for the typo, I am testing Cyanite with Cassandra 3.7.
Bellow is the cyanite configuration I've been using so far:
logging:
level: debug
console: true
files:
- "/opt/cyanite/log/cyanite.log"
overrides:
io.cyanite: "debug"
drift:
type: no-op
input:
- type: "carbon"
port: 2003
host: 127.0.0.1
api:
port: 8080
host: 0.0.0.0
store:
keyspace: "metric"
cluster: 127.0.0.1
index:
type: cassandra
cluster: 127.0.0.1
queues:
defaults:
ingestq:
pool-size: 100
queue-capacity: 2000000
writeq:
pool-size: 100
queue-capacity: 2000000
engine:
rules:
'^carbon\..*': [ "10s:1d", "5m:7d", "15m:30d" ]
default: [ "60s:1d" ]
And here is how I start cyanite
java \
-server \
-ea \
-Xms2G \
-Xmx2G \
-XX:+UseCompressedOops \
-XX:+UseFastAccessorMethods \
-Djava.net.preferIPv4Stack=true \
-Xloggc:/opt/cyanite/log/gc.log \
-XX:+PrintGCDetails \
-XX:+PrintGCDateStamps \
-XX:+PrintHeapAtGC \
-XX:+PrintTenuringDistribution \
-XX:+PrintGCApplicationStoppedTime \
-XX:+PrintPromotionFailure \
-XX:+UseGCLogFileRotation \
-XX:NumberOfGCLogFiles=5 \
-XX:GCLogFileSize=10M \
-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.port=9999 \
-Dcom.sun.management.jmxremote.rmi.port=9999 \
-Dcom.sun.management.jmxremote.ssl=false \
-Dcom.sun.management.jmxremote.authenticate=false \
-XX:+UseG1GC \
-XX:MaxGCPauseMillis=200 \
-XX:ParallelGCThreads=8 \
-XX:ConcGCThreads=8 \
-XX:+AlwaysPreTouch \
-jar /opt/cyanite/bin/cyanite-0.5.1-standalone.jar \
-f /opt/cyanite/etc/config.yml
Thank you. Could you also tell me which branch / RSA you built your 0.5.1
jar from?
cyanite# git branch
* (detached from 7cef5e7)
master
Couldn't use the latest snapshot as it caused "Queue full" messages instant on startup.
"Queue full" messages on startup mostly mean that messages were dropped in order to avoid node crash, which only means you have to increase your queue size in config:
queue:
queue-capacity: 1048576
(or any larger power of two).
This would (usually) be enough to handle load of 10-20K events per second without building up too big of a backlog.
I agree default capacity is unreasonably small. Sorry about that, I'll fix that on master.
I've ran a long-running stress test and over the course of 4 days I did not notice any memory leak: after major GC memory goes back to normal.
If you generate new metric name every time, this may cause cyanite to OOM, since we do not have a notion of expiring metrics (yet) and do not GC metrics. For reasonable stress with realistic graphite loads you could use graphite-stresser
.
I'm happy to help, but I do not have enough information to debug this memory leak.
The stress test was run using soakTest which I think generates a lot of different metrics (or not?). Can you confirm whether the test scenario was wrong compared to current code?
I can't really confirm since I do not really know what the test scenario was...
soakTest script included in your repo with varying mpm 10000-30000 but problem remained the same.
Sorry, I can not reproduce the problem. We have instances running for weeks stably without OOM. You can use flight recorder to profile the heap and learn which objects are causing the problem.
But once again, if you're not running the latest master, I can't really help and/or debug on the diverged code.
Configuration fixed, can't reproduce on master. Closing together with #252
We are currently evaluating Cyanite as our main carbon backend for production use.
Tried running latest releases which seems to deprecate Elasticsearch indexing in favor of Cassandra which actually seems pretty cool (one less dependency). All soakTest runs have been done using latest (master) release + Cassandra 3.0.7 (Datastax package).
Unfortunately it seems that we're on the same situation as described in this issue comment. Constant increase of JVM's Old Gen until Cyanite becomes unresponsive.
Tried different GC scenarios, different soakTest mpm (10000 up to 30000), different heap sizes, different queue size settings in conf.yml but always the same result.
Is anyone else running Cyanite in production and if yes, what is your metric volume and your setup settings?