[Q] Slow performance and OOM restarts

Problem Description

We are running go-carbon 0.15.6 as a backend for go-carbonapi 0.14.0 and as the number of metrics has kept increasing (~1.6TB of *.wsp files), performance kept degrading. We're getting the OOM restarts of the docker containers and also errors in the logs saying "Could not Expand Globs - Context Cancelled"

For the go-carbon servers, the hardware is 8 CPU Cores, 48 GB RAM and a really fast storage / disk. The current config:

[common]
user = "root"
graph-prefix = "carbon.agents.{host}"
metric-endpoint = "tcp://%%METRIC_DESTINATION%%:2003"
max-cpu = 6
metric-interval = "1m0s"

[whisper]
data-dir = "/opt/go-carbon/whisper"
schemas-file = "/etc/go-carbon/storage-schemas.conf"
aggregation-file = ""
workers = 8
max-updates-per-second = 400
sparse-create = true
flock = true
enabled = true
hash-filenames = true
remove-empty-file = true

[cache]
max-size = 400000000
write-strategy = "noop"

[udp]
enabled = false

[tcp]
listen = ":2003"
enabled = true
buffer-size = 2000000
compression = ""

[pickle]
enabled = false

[carbonlink]
enabled = false

[grpc]
enabled = false

[tags]
enabled = false

[carbonserver]
listen = ":80"
enabled = true
query-cache-enabled = true
query-cache-size-mb = 4096
find-cache-enabled = true
buckets = 10
max-globs = 10000
fail-on-max-globs = false
metrics-as-counters = false
trie-index = true
concurrent-index = true
realtime-index = 400000000
trigram-index = false
cache-scan = false
graphite-web-10-strict-mode = true
internal-stats-dir = "/opt/go-carbon/carbonserver"
read-timeout = "1m0s"
idle-timeout = "1m0s"
write-timeout = "1m0s"
scan-frequency = "10m0s"
stats-percentiles = [99, 95, 75, 50]

[dump]
enabled = false

[pprof]
enabled = false

[[logging]]
logger = ""
file = "stdout"
level = "error"
encoding = "mixed"
encoding-time = "iso8601"
encoding-duration = "seconds"

For go-carbonapi servers, the hardware is 14 CPU Cores and 16GB RAM and the config file is:

listen: "0.0.0.0:80"
concurency: 1000
cache:
   type: "mem"
   size_mb: 1024
   defaultTimeoutSec: 5
cpus: 0
tz: ""
sendGlobsAsIs: true
maxBatchSize: 5000
idleConnections: 100
pidFile: ""
expireDelaySec: 10
logger:
    - logger: ""
      file: "stdout"
      level: "error"
      encoding: "console"
      encodingTime: "iso8601"
      encodingDuration: "seconds"
upstreams:
    tldCacheDisabled: true
    doMultipleRequestsIfSplit: true
    buckets: 10
    timeouts:
        global: "12s"
        afterStarted: "10s"
        connect: "200ms"
    concurrencyLimit: 0
    keepAliveInterval: "10s"
    maxIdleConnsPerHost: 100
    maxGlobs: 5000
    maxBatchSize: 5000
    backends:
      - "example1"
      - "example2"

One problem we are aware of are metrics from K8s applications, where they create a lot of "dead" folders every-time a pod is being re-spun and therefore named differently. We're trying to move K8s apps to a different metrics solution, but in the meantime we have setup a cronjob to cleanup stale data / folders.

Can you please give us a hand with anything looking out of the ordinary within our config? We're also considering increasing the hardware spec for the go-carbon storage nodes to 12 CPU Cores and 64GB RAM, but we also believe some bits could also be improved within our configuration.

Many, many thanks!

go-graphite / go-carbon

[Q] Slow performance and OOM restarts #504