go-graphite / go-carbon

Golang implementation of Graphite/Carbon server with classic architecture: Agent -> Cache -> Persister
MIT License
805 stars 123 forks source link

[BUG] Values of new metrics not renderable with realtime-index enabled. #395

Closed timw6n closed 3 years ago

timw6n commented 3 years ago

Describe the bug

Quite a similar situation to that reported in https://github.com/go-graphite/go-carbon/issues/372. We have a large cluster (~3 million metrics) and so make heavy use of the cache, with max-updates-per-second throttled down to keep disk activity within sensible limits.

Have recently upgraded to 0.15.5, and at that point enabled realtime-index and concurrent-index. At present the issue we're having is that the names of new metrics are visible immediately (and so can be found by Graphite find queries) but the datapoints cannot be retrieved from the cache (render queries) until the whisper files end up on disk. New datapoints on existing metrics pull through from the cache fine.

Logged error seems to be

[2021-01-19T16:58:32.475Z] ERROR [access] fetch failed {"handler": "render", "url": "/render/?format=protobuf&from=1611064712&target=[redacted]&until=1611075512", "peer": "10.11.173.11:44604", "carbonapi_uuid": "04877f92-6d7c-4ae4-b593-86d3123f7f12", "carbonzipper_uuid": "04877f92-6d7c-4ae4-b593-86d3123f7f12", "format": "carbonapi_v2_pb", "targets": ["[redacted]], "runtime_seconds": 0.001309091, "reason": "panic during serving the request", "stack": "github.com/go-graphite/go-carbon/carbonserver.(*CarbonserverListener).renderHandler.func1\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/carbonserver/render.go:201\nruntime.gopanic\n\t/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/runtime/panic.go:969\nruntime.panicmem\n\t/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/runtime/panic.go:212\nruntime.sigpanic\n\t/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/runtime/signal_unix.go:720\ngithub.com/go-graphite/go-carbon/carbonserver.(*CarbonserverListener).fetchFromCache\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/carbonserver/fetchfromcache.go:56\ngithub.com/go-graphite/go-carbon/carbonserver.(*CarbonserverListener).fetchSingleMetric\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/carbonserver/fetchsinglemetric.go:141\ngithub.com/go-graphite/go-carbon/carbonserver.(*CarbonserverListener).fetchSingleMetricV2\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/carbonserver/fetchsinglemetric.go:162\ngithub.com/go-graphite/go-carbon/carbonserver.(*CarbonserverListener).fetchDataPB\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/carbonserver/render.go:543\ngithub.com/go-graphite/go-carbon/carbonserver.(*CarbonserverListener).prepareDataProto\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/carbonserver/render.go:394\ngithub.com/go-graphite/go-carbon/carbonserver.(*CarbonserverListener).fetchWithCache\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/carbonserver/render.go:311\ngithub.com/go-graphite/go-carbon/carbonserver.(*CarbonserverListener).renderHandler\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/carbonserver/render.go:209\nnet/http.HandlerFunc.ServeHTTP\n\t/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/net/http/server.go:2042\ngithub.com/go-graphite/go-carbon/carbonserver.TraceHandler.func1\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/carbonserver/trace.go:140\ngithub.com/dgryski/httputil.TimeHandler.func1\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/vendor/github.com/dgryski/httputil/times.go:26\ngithub.com/dgryski/httputil.TrackConnections.func1\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/vendor/github.com/dgryski/httputil/track.go:40\nnet/http.HandlerFunc.ServeHTTP\n\t/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/net/http/server.go:2042\nnet/http.(*ServeMux).ServeHTTP\n\t/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/net/http/server.go:2417\ngithub.com/NYTimes/gziphandler.GzipHandlerWithOpts.func1.1\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/vendor/github.com/NYTimes/gziphandler/gzip.go:287\nnet/http.HandlerFunc.ServeHTTP\n\t/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/net/http/server.go:2042\nnet/http.serverHandler.ServeHTTP\n\t/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/net/http/server.go:2843\nnet/http.(*conn).serve\n\t/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/net/http/server.go:1925", "error": "runtime error: invalid memory address or nil pointer dereference", "http_code": 500}

Go-carbon Configuration:

[common]
graph-prefix = "carbon.agents.{host}"
metric-endpoint = "tcp://[redacted]:2003"
max-cpu = 4
metric-interval = "1m0s"

[whisper]
data-dir = "/opt/go-carbon/whisper"
schemas-file = "/etc/go-carbon/storage-schemas.conf"
aggregation-file = ""
workers = 8
max-updates-per-second = 400
sparse-create = false
flock = true
enabled = true
hash-filenames = true

[cache]
max-size = 500000000
write-strategy = "noop"

[tcp]
listen = ":2003"
enabled = true
buffer-size = 2000000
compression = ""

[carbonlink]
enabled = false

[grpc]
enabled = false

[tags]
enabled = false

[carbonserver]
listen = ":80"
enabled = true
query-cache-enabled = true
query-cache-size-mb = 4096
find-cache-enabled = true
buckets = 10
max-globs = 1000
fail-on-max-globs = false
metrics-as-counters = false
trie-index = true
concurrent-index = true
realtime-index = 1000000
trigram-index = false
cache-scan = false
graphite-web-10-strict-mode = true
internal-stats-dir = "/opt/go-carbon/carbonserver"
read-timeout = "1m0s"
idle-timeout = "1m0s"
write-timeout = "1m0s"
scan-frequency = "10m0s"
stats-percentiles = [99, 95, 75, 50]

[dump]
enabled = false

[pprof]
enabled = false

[[logging]]
logger = ""
file = "stdout"
level = "error"
encoding = "mixed"
encoding-time = "iso8601"
encoding-duration = "seconds"

Storage schemas

Everything relevant to this on 1m:30d.

bom-d-van commented 3 years ago

hi @timw6n-thg , can you try enabling cache-scan as well, to see if the panic is resolved?

realtime-index and concurrent-index reuses part of its implementation.

bom-d-van commented 3 years ago

I have also pushed a fix, with the new changes, we don't have to enable cache-scan when using realtime-index.

timw6n commented 3 years ago

Thanks for that fix @bom-d-van. I've applied the patch to my cluster and it seems to have improved the situation.

What's happening now seems to be that the metrics are being added to the index immediately (and are now renderable at that time 😀) but are being removed the next time the disk scan runs, presumably as I have cache-scan = false.

I will give enabling cache-scan a go tomorrow. Have previously had difficulty with that though on older versions, presumably because of the volume of cached metrics (130 million data-points atm to give you an idea of scale).

bom-d-van commented 3 years ago

but are being removed the next time the disk scan runs, presumably as I have cache-scan = false

@timw6n-thg Hmm, this sounds odd. Was concurrent-index still enabled in the test?

timw6n commented 3 years ago

Yes, that was enabled. Config unchanged from above.

bom-d-van commented 3 years ago

Just in case, was there any restart before the disk scan during your test? If so, I might have an explanation.

timw6n commented 3 years ago

No restarts I'm afraid.

timw6n commented 3 years ago

Other than the one when I deployed the new go-carbon version of course, but that was before started sending the new metrics.

bom-d-van commented 3 years ago

Oh, I think I have a good explanation now. Yeah, it's a bug in realtime index.

But your scan-frequency is 10m, so that means after 10min, the new metrics are still not flushed to disk.

timw6n commented 3 years ago

Yeah, running a queue writeout time of about 2 hours at the moment, which I'm aware is probably not a normal configuration. Writes throttled down a lot to protect the underlying storage.

bom-d-van commented 3 years ago

Will see how can I fix the issue. cache-scan is probably not helpful for this case, but you can still give it another shot.


If the storage is a big factor for your cluster, have you tried out the compressed feature? There should be some performance gains with it. But you would lose out of order updates on the metrics.

https://github.com/go-graphite/go-whisper#compressed-format

bom-d-van commented 3 years ago

@timw6n-thg I was wrong again. Actually enabling cache-scan would help avoiding the issue. But again, I have pushed a fix to make sure that new un-flushed metrics could be reindexed after disk scan: https://github.com/go-graphite/go-carbon/pull/396

(also handled a bug related to cache-scan, the new metrics aren't removed from memory after it's persisted. not sure if it's related to the problem that you had when testing cache-scan).

timw6n commented 3 years ago

That last patch seems to have done the trick. New metrics pulling through correctly straight away. Thank you.