Not able to pull metrics for more than 3 months

rickyari commented 3 years ago

We have a setup of go-carbon in our org and we ingest millions of metrics every minute. But somehow I am not able to visualize metrics older than 3 months for one of our graphite cluster. The query times out exactly after 60 sec. I have verified all the configs (go-carbon, carbonapi, c-relay) and increased the timeouts to 5 minutes where ever i could find the setting but still I am unable to query data older than 3 months. I am happy to share configs . Please suggest what could be the reason for this behavior.

deniszh commented 3 years ago

Sure, @rickyari, please share your configs. Timeouts should be adjustable.

rickyari commented 3 years ago

@deniszh This is the go-carbon.conf from one of the nodes in a three node cluster.

[common]
user = "root"
graph-prefix = "go-carbon.agents.{host}"

# controlls GOMAXPROCS which itself controlls maximum number
# of actively executing threads, those which are blocked in systcalls
# are NOT part of this limit
max-cpu = 8
metric-interval = "1m0s"

[whisper]
data-dir = "/mnt/array1/graphite/whisper"
schemas-file = "/etc/go-carbon/whisper-schemas.conf"
aggregation-file = ""
workers = 8
max-updates-per-second = 0
sparse-create = true
enabled = true

[cache]
max-size = 1000000
write-strategy = "noop"

[pickle]
enabled = false

[tcp]
listen = ":2003"
enabled = true

[udp]
enabled = false

[carbonserver]
listen = ":8080"
enabled = true
buckets = 10
metrics-as-counters = false
read-timeout = "180s" # default 60s
write-timeout = "60s"
query-cache-enabled = true
query-cache-size-mb = 0
find-cache-enabled = true
trigram-index = false
scan-frequency = "5m0s"
max-globs = 100
graphite-web-10-strict-mode = true
internal-stats-dir = ""

[carbonlink]
listen = "127.0.0.1:7002"
enabled = true
read-timeout = "30s"
query-timeout = "300ms"

[dump]
# Enable dump/restore function on USR2 signal
enabled = true
# Directory for store dump data. Should be writeable for carbon
path = "/mnt/array1"

[pprof]
listen = "localhost:7007"
enabled = false

# Default logger
[[logging]]
# logger name
# available loggers:
# * "" - default logger for all messages without configured special logger
# @TODO
logger = ""
# Log output: filename, "stderr", "stdout", "none", "" (same as "stderr")
file = "/var/log/go-carbon/go-carbon.log"
# Log level: "debug", "info", "warn", "error", "dpanic", "panic", and "fatal"
level = "error"
# Log format: "json", "console", "mixed"
encoding = "mixed"
# Log time format: "millis", "nanos", "epoch", "iso8601"
encoding-time = "iso8601"
# Log duration format: "seconds", "nanos", "string"
encoding-duration = "seconds"

ritmas commented 3 years ago

@rickyari what's the content of file /etc/go-carbon/whisper-schemas.conf?

deniszh commented 3 years ago

Also, please show carbonapi version and config - timeout can happen on front-end side too

On Wed, 28 Apr 2021 at 14:09, Rimantas Ragainis @.***> wrote:

@rickyari https://github.com/rickyari what's the content of file /etc/go-carbon/whisper-schemas.conf?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/go-graphite/go-carbon/issues/409#issuecomment-828368211, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJLTVV6S54HWAKOPIKCIBLTK7UIFANCNFSM43WJFPSQ .

rickyari commented 3 years ago

/etc/go-carbon/whisper-schemas.conf has these entries

[default_1min_for_365day] pattern = ^. retentions = 60s:365d

rickyari commented 3 years ago

/etc/carbonapi/carbonapi.yaml :

# Need to be URL, http or https
# This url specifies the backend or a loadbalancer
#
# Is you are using carbonzipper you should set it to
# zipper's url
#
# If you are using plain go-carbon or graphite-clickhouse
# you should set it to URL of go-carbon's carbonserver module
# or graphite-clickhouse's http url.
# Listen address, should always include hostname or ip address and a port.
listen: ":8081"
# Max concurrent requests to CarbonZipper
concurency: 20
cache:
   # Type of caching. Valid: "mem", "memcache", "null"
   type: "mem"
   # Cache limit in megabytes
   size_mb: 10240
   # Default cache timeout value. Identical to DEFAULT_CACHE_DURATION in graphite-web.
   defaultTimeoutSec: 60
   # Only used by memcache type of cache. List of memcache servers.
#   memcachedServers:
#       - "127.0.0.1:1234"
#       - "127.0.0.2:1235"
# Amount of CPUs to use. 0 - unlimited
cpus: 0
# Timezone, default - local
tz: ""
functionsConfig:
    graphiteWeb: /etc/carbonapi/graphiteWeb.yaml
# If 'true', carbonapi will send requests as is, with globs and braces
# Otherwise for each request it will generate /metrics/find and then /render
# individual metrics.
# true --- faster, but will cause carbonzipper to consume much more RAM.
#
# For some backends (e.x. graphite-clickhouse) this *MUST* be set to true in order
# to get reasonable performance
#
# For go-carbon --- it depends on how you use it.
sendGlobsAsIs: true
# If sendGlobsAsIs is set and resulting response will be larger than maxBatchSize
# carbonapi will revert to old behavir. This allows you to use benifits of passing
# globs as is but keep memory usage in sane limits.
#
# For go-carbon you might want it to keep in some reasonable limits, 100 is good "safe" defaults
#
# For some backends (e.x. graphite-clickhouse) you might want to set it to some insanly high value, like 100000
maxBatchSize: 100
graphite:
    # Host:port where to send internal metrics
    # Empty = disabled
    host: "<Removed for organisation privacy>"
    interval: "60s"
    prefix: "carbon.api"
    # rules on how to construct metric name. For now only {prefix} and {fqdn} is supported.
    # {prefix} will be replaced with the content of {prefix}
    # {fqdn} will be repalced with fqdn
    pattern: "{prefix}.{fqdn}"
# Maximium idle connections to carbonzipper
idleConnections: 10
pidFile: ""
# See https://github.com/go-graphite/carbonzipper/blob/master/example.conf#L70-L108 for format explanation
upstreams:
    # Number of 100ms buckets to track request distribution in. Used to build
    # 'carbon.zipper.hostname.requests_in_0ms_to_100ms' metric and friends.
    # Requests beyond the last bucket are logged as slow
    # (default of 10 implies "slow" is >1 second).
    buckets: 10

    timeouts:
        # Maximum backend request time for find requests.
        find: "10s"
        # Maximum backend request time for render requests. This is total one and doesn't take into account in-flight requests
        render: "120s"  ## default 60s
        # Timeout to connect to the server
        connect: "200ms"

    # Number of concurrent requests to any given backend - default is no limit.
    # If set, you likely want >= MaxIdleConnsPerHost
    concurrencyLimitPerServer: 0

    # Configures how often keep alive packets will be sent out
    keepAliveInterval: "30s"

    # Control http.MaxIdleConnsPerHost. Large values can lead to more idle
    # connections on the backend servers which may bump into limits; tune with care.
    maxIdleConnsPerHost: 100

    # "http://host:port" array of instances of carbonserver stores
    # This is the *ONLY* config element in this section that MUST be specified.
    backendsv2:
        backends:
          -
            groupName: "gocarbon"
            protocol: "carbonapi_v3_pb"
            lbMethod: "broadcast"
            maxTries: 3
            maxBatchSize: 100
            servers:
            <Removed for organisation privacy>

#    carbonsearch:
        # Instance of carbonsearch backend
#        backend: "http://127.0.0.1:8070"
        # carbonsearch prefix to reserve/register
#        prefix: "virt.v1.*"

    # Enable compatibility with graphite-web 0.9
    # This will affect graphite-web 1.0+ with multiple cluster_servers
    # Default: disabled
    graphite09compat: false
# If not zero, enabled cache for find requests
# This parameter controls when it will expire (in seconds)
# Default: 600 (10 minutes)
expireDelaySec: 10
graphTemplates: /etc/carbonapi/graphTemplates.yaml
logger:
    - logger: ""
      file: "/var/log/carbonapi.log"
      level: "warn"
      encoding: "json"

rickyari commented 3 years ago

@deniszh Can you please suggest what could be wrong with the conf for grafana not showing data of more than 3 months.

bom-d-van commented 3 years ago

Hi @rickyari how many metrics were you trying to query? Was there any errors logged in go-carbon.log and go-carbon-access.log?

deniszh commented 3 years ago

@bom-d-van : Do we have some internal limits for metric count?

deniszh commented 3 years ago

@rickyari : sorry, I was lost our conversation nofifications somehow. Configs looks OK, but could you try to move timeout for backends inside backends section? IIRC there was a bug with values propagation there:

backendsv2:
        backends:
          -
            groupName: "gocarbon"
            protocol: "carbonapi_v3_pb"
            lbMethod: "broadcast"
            maxTries: 3
            maxBatchSize: 100
            timeouts:
                find: "10s"
                render: "120s"
                connect: "200ms"
            servers:

If it doesn't help I'm afraid you need to increase logger level to info and check logs.

bom-d-van commented 3 years ago

Do we have some internal limits for metric count?

@deniszh Yep, I think so. But it probably isn't the issue that @rickyari is having.

I was just wondering if the query was trying to fetch too many metrics or too much data.

deniszh commented 3 years ago

@bom-d-van : probably yes, but why it's timing out after 60 seconds if all timeouts are set higher then 60 seconds in configs above?

rickyari commented 3 years ago

@bom-d-van @deniszh Please let me know if you need any other information from my side for troubleshooting further into this.

deniszh commented 3 years ago

@rickyari : which version of carbonapi do you use? did you try to move timeouts inside backends section as I described above?

And @bom-d-van asked:

how many metrics were you trying to query? Was there any errors logged in go-carbon.log and go-carbon-access.log?

rickyari commented 3 years ago

@deniszh I do not get your comment regarding the timeout movement to backends section. Is it not already in the backends section. Pardon my Ignorance. backendsv2: backends:

        groupName: "gocarbon"
        protocol: "carbonapi_v3_pb"
        lbMethod: "broadcast"
        maxTries: 3
        maxBatchSize: 100
        timeouts:
            find: "10s"
            render: "120s"
            connect: "200ms"
        servers:

deniszh commented 3 years ago

@rickyari : Indeed. Did you try to put it into backends section as I suggested?

Configs looks OK, but could you try to move timeout for backends inside backends section? IIRC there was a bug with values propagation there:

Civil commented 3 years ago

@rickyari can you elaborate on "not able to pull metrics":

What are the symptoms? Do you get "500" from carbonapi?
What is in the logs of carbonapi for that query?
What is in the logs of go-carbon for that query?
How does the query looks like? By that I mean first of all how many metrics does it fetch?
How many backends do you have? Do they have a copy of the data or some of them have some doesn't? And if so - what's the topology?
How many concurrent queries do you run? It would be great if you can check CPU and RAM and Disk I/O usage on backends and frontends during the query.
What hardware do you have? Just FYI 3 months of per minute data is 1440 * 90 = 129600 and that would be about 1.5 MB of data per metric to be fetched for a single metric, if your query have globs and matches multiple metrics that can easily cross boundries of gigabytes to be sent out to frontend and it might become important how much RAM you have and what kind of network do you have (as well as how far away in terms of network latency those servers are).
Versions of go-carbon and carbonapi

P.S. Actual query and data is not that important, what important is how much metrics it query. So if it's important for you - you can replace actual parts of metric name with something else, just keep the globs where they are.

P.S.S. @deniszh how values are propagated depends on carbonapi version actually. In latest stable it should work as expected and it shouldn't matter if you define global timeout or per backend section. But it was never specified what it is.

go-graphite / go-carbon

Not able to pull metrics for more than 3 months #409

@deniszh I do not get your comment regarding the timeout movement to backends section. Is it not already in the backends section. Pardon my Ignorance. backendsv2: backends: