Closed timw6n closed 3 years ago
hi @timw6n-thg , can you try enabling cache-scan
as well, to see if the panic is resolved?
realtime-index
and concurrent-index
reuses part of its implementation.
I have also pushed a fix, with the new changes, we don't have to enable cache-scan
when using realtime-index
.
Thanks for that fix @bom-d-van. I've applied the patch to my cluster and it seems to have improved the situation.
What's happening now seems to be that the metrics are being added to the index immediately (and are now renderable at that time 😀) but are being removed the next time the disk scan runs, presumably as I have cache-scan = false.
I will give enabling cache-scan a go tomorrow. Have previously had difficulty with that though on older versions, presumably because of the volume of cached metrics (130 million data-points atm to give you an idea of scale).
but are being removed the next time the disk scan runs, presumably as I have cache-scan = false
@timw6n-thg Hmm, this sounds odd. Was concurrent-index
still enabled in the test?
Yes, that was enabled. Config unchanged from above.
Just in case, was there any restart before the disk scan during your test? If so, I might have an explanation.
No restarts I'm afraid.
Other than the one when I deployed the new go-carbon version of course, but that was before started sending the new metrics.
Oh, I think I have a good explanation now. Yeah, it's a bug in realtime index.
But your scan-frequency
is 10m, so that means after 10min, the new metrics are still not flushed to disk.
Yeah, running a queue writeout time of about 2 hours at the moment, which I'm aware is probably not a normal configuration. Writes throttled down a lot to protect the underlying storage.
Will see how can I fix the issue. cache-scan
is probably not helpful for this case, but you can still give it another shot.
If the storage is a big factor for your cluster, have you tried out the compressed
feature? There should be some performance gains with it. But you would lose out of order updates on the metrics.
@timw6n-thg I was wrong again. Actually enabling cache-scan
would help avoiding the issue.
But again, I have pushed a fix to make sure that new un-flushed metrics could be reindexed after disk scan: https://github.com/go-graphite/go-carbon/pull/396
(also handled a bug related to cache-scan
, the new metrics aren't removed from memory after it's persisted. not sure if it's related to the problem that you had when testing cache-scan
).
That last patch seems to have done the trick. New metrics pulling through correctly straight away. Thank you.
Describe the bug
Quite a similar situation to that reported in https://github.com/go-graphite/go-carbon/issues/372. We have a large cluster (~3 million metrics) and so make heavy use of the cache, with max-updates-per-second throttled down to keep disk activity within sensible limits.
Have recently upgraded to 0.15.5, and at that point enabled realtime-index and concurrent-index. At present the issue we're having is that the names of new metrics are visible immediately (and so can be found by Graphite find queries) but the datapoints cannot be retrieved from the cache (render queries) until the whisper files end up on disk. New datapoints on existing metrics pull through from the cache fine.
Logged error seems to be
Go-carbon Configuration:
Storage schemas
Everything relevant to this on 1m:30d.