go-graphite / go-carbon

Golang implementation of Graphite/Carbon server with classic architecture: Agent -> Cache -> Persister
MIT License
801 stars 126 forks source link

[Q] Slow performance and OOM restarts #504

Open lupuletic opened 1 year ago

lupuletic commented 1 year ago

Problem Description

We are running go-carbon 0.15.6 as a backend for go-carbonapi 0.14.0 and as the number of metrics has kept increasing (~1.6TB of *.wsp files), performance kept degrading. We're getting the OOM restarts of the docker containers and also errors in the logs saying "Could not Expand Globs - Context Cancelled"

For the go-carbon servers, the hardware is 8 CPU Cores, 48 GB RAM and a really fast storage / disk. The current config:

[common]
user = "root"
graph-prefix = "carbon.agents.{host}"
metric-endpoint = "tcp://%%METRIC_DESTINATION%%:2003"
max-cpu = 6
metric-interval = "1m0s"

[whisper]
data-dir = "/opt/go-carbon/whisper"
schemas-file = "/etc/go-carbon/storage-schemas.conf"
aggregation-file = ""
workers = 8
max-updates-per-second = 400
sparse-create = true
flock = true
enabled = true
hash-filenames = true
remove-empty-file = true

[cache]
max-size = 400000000
write-strategy = "noop"

[udp]
enabled = false

[tcp]
listen = ":2003"
enabled = true
buffer-size = 2000000
compression = ""

[pickle]
enabled = false

[carbonlink]
enabled = false

[grpc]
enabled = false

[tags]
enabled = false

[carbonserver]
listen = ":80"
enabled = true
query-cache-enabled = true
query-cache-size-mb = 4096
find-cache-enabled = true
buckets = 10
max-globs = 10000
fail-on-max-globs = false
metrics-as-counters = false
trie-index = true
concurrent-index = true
realtime-index = 400000000
trigram-index = false
cache-scan = false
graphite-web-10-strict-mode = true
internal-stats-dir = "/opt/go-carbon/carbonserver"
read-timeout = "1m0s"
idle-timeout = "1m0s"
write-timeout = "1m0s"
scan-frequency = "10m0s"
stats-percentiles = [99, 95, 75, 50]

[dump]
enabled = false

[pprof]
enabled = false

[[logging]]
logger = ""
file = "stdout"
level = "error"
encoding = "mixed"
encoding-time = "iso8601"
encoding-duration = "seconds"

For go-carbonapi servers, the hardware is 14 CPU Cores and 16GB RAM and the config file is:

listen: "0.0.0.0:80"
concurency: 1000
cache:
   type: "mem"
   size_mb: 1024
   defaultTimeoutSec: 5
cpus: 0
tz: ""
sendGlobsAsIs: true
maxBatchSize: 5000
idleConnections: 100
pidFile: ""
expireDelaySec: 10
logger:
    - logger: ""
      file: "stdout"
      level: "error"
      encoding: "console"
      encodingTime: "iso8601"
      encodingDuration: "seconds"
upstreams:
    tldCacheDisabled: true
    doMultipleRequestsIfSplit: true
    buckets: 10
    timeouts:
        global: "12s"
        afterStarted: "10s"
        connect: "200ms"
    concurrencyLimit: 0
    keepAliveInterval: "10s"
    maxIdleConnsPerHost: 100
    maxGlobs: 5000
    maxBatchSize: 5000
    backends:
      - "example1"
      - "example2"

One problem we are aware of are metrics from K8s applications, where they create a lot of "dead" folders every-time a pod is being re-spun and therefore named differently. We're trying to move K8s apps to a different metrics solution, but in the meantime we have setup a cronjob to cleanup stale data / folders.

Can you please give us a hand with anything looking out of the ordinary within our config? We're also considering increasing the hardware spec for the go-carbon storage nodes to 12 CPU Cores and 64GB RAM, but we also believe some bits could also be improved within our configuration.

Many, many thanks!

deniszh commented 1 year ago

Hi @lupuletic

Sorry for late reply, but from our experience your hardware is too limited for such load. Data size is not that important, but number of metrics it is. Clean up cronjob is a good thing, but it's not miracle. In our clusters we have millions of metrics (i.e. files per node), but for that you need memory for 1) go-carbon cache 2) go-carbon index 3) file system cache 4) file system itself. So, 64G RAM doesn't sound like bad idea.