graphite-project / graphite-web

A highly scalable real-time graphing system
http://graphite.readthedocs.org/
Apache License 2.0
5.89k stars 1.26k forks source link

Tags DB disappares with no apparent reason[Q] #2670

Closed turbopape closed 3 years ago

turbopape commented 3 years ago

Hey Guys,

I am operating a Kubernetes setup.

Graphite is sitting behind carbon-relay-ng. I am using it's stock whisper storage engine. My specific config is as follows:

[carbon]
    pattern = ^carbon\.
    retentions = 10s:6h,1m:90d

    [default]
    pattern = .*
    retentions = 2m:6h,10m:7d,30m:30d,60m:360d
    [default]
    pattern : .*
    xFilesFactor = 0.0
    aggregationMethod = avg

Carbon-relay-ng is sipping tags-enabled messages from a rabbitmq queue. Here is the config:

## Global settings ##
    # instance id's distinguish stats of multiple relays.
    # do not run multiple relays with the same instance id.
    # supported variables:
    #  ${HOST} : hostname
    instance = "${HOST}"

    ## System ##
    # this setting can be used to override the default GOMAXPROCS logic
    # it is ignored if the GOMAXPROCS environment variable is set
    max_procs = 2
    pid_file = "carbon-relay-ng.pid"
    # directory for spool files
    spool_dir = "spool"

    ## Logging ##
    # one of trace debug info warn error fatal panic
    # see docs/logging.md for level descriptions
    # note: if you used to use "notice", you should now use "info".
    log_level = "info"

    ## Admin ##
    admin_addr = "0.0.0.0:2004"
    http_addr = "0.0.0.0:8081"

    ## Inputs ##
    ### plaintext Carbon ###
    listen_addr = "0.0.0.0:2003"
    # close inbound plaintext connections if they've been idle for this long ("0s" to disable)
    plain_read_timeout = "0s"
    ### Pickle Carbon ###
    pickle_addr = "0.0.0.0:2013"
    # close inbound pickle connections if they've been idle for this long ("0s" to disable)
    pickle_read_timeout = "0s"

    ## Validation of inputs ##
    # you can also validate that each series has increasing timestamps
    validate_order = false

    # How long to keep track of invalid metrics seen
    # Useful time units are "s", "m", "h"
    bad_metrics_max_age = "24h"

    [[route]]
    key = 'carbon-default'
    type = 'sendAllMatch'
    # prefix = ''
    # notPrefix = ''
    # sub = ''
    # notSub = ''
    # regex = '.*'
    # notRegex = ''
    destinations = [
      'graphite-statsd.graphite.svc.cluster.local:2003 spool=true pickle=false'
    ]

    ### AMQP ###
    [amqp]
    amqp_enabled = true
    amqp_host = "aRabbitHost"
    amqp_port = 5672
    amqp_user = "SomeUser"
    amqp_password = "SomePassword"
    amqp_vhost = "/"
    amqp_exchange = "messages"
    amqp_queue = ""
    amqp_key = "metrics"
    amqp_durable = false
    amqp_exclusive = true

    ## Instrumentation ##
    [instrumentation]
    # in addition to serving internal metrics via expvar, you can send them to graphite/carbon
    # IMPORTANT: setting this to "" will disable flushing, and metrics will pile up and lead to OOM
    # see https://github.com/graphite-ng/carbon-relay-ng/issues/50
    # so for now you MUST send them somewhere. sorry.
    # (Also, the interval here must correspond to your setting in storage-schemas.conf if you use Grafana Cloud)
    graphite_addr = "graphite-statsd.graphite.svc.cluster.local:2003"
    graphite_interval = 10000  # in ms

For no apparent reason, from time to time, I lose all my tag related information. All gone.

I thought this was related to k8s upgrading hosts (though somehow reinitializing volumes), suspected it was related to a problem with carbon-relay-ng deleting tags when it loses connection with rabbit etc... But then all other "normal" series are there and working good. Sorry if this is a noob-ish question but I really explored every possible idea to no avail. Has anyone here experimented the same before? Will it be better if I anyhow use a Redis backend? Thank you so much :)

deniszh commented 3 years ago

@turbopape : Redis is recommended production setup for TagDB, indeed.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.