[Q] Disabling scan-frequency for Carbonserver

loitho commented 3 years ago

Hi there,

excuse the maybe naive question but, we're using go-carbon and are receiving 650 000 Metrics / Node, out of a 4 node cluster (so 2.5 Million metrics) per minute. One issue we're facing is that the load iops peak tremendously when a scan is triggered to build the index for carbonserver. There is around 3 750 000 metrics per node. I then applied the following configuration for Carbonserver :

[carbonserver]
enabled = true
buckets = 10
metrics-as-counters = false
read-timeout = "60s"
write-timeout = "60s"

query-cache-enabled = true
query-cache-size-mb = 0
find-cache-enabled = true

trigram-index = false
scan-frequency = "0m0s"
trie-index = true

As you can see, I set scan-frequency to 0. And it's working nicely, my servers don't get choked for 5 minutes (very high load during this period) trying to read all of the files. (This behavior was happening even when only the trie index was enabled)

And so I was wondering, is this a problem to run the configuration like that ? Considering that I'm still able to have good performance from my cluster compared to before, is there a reason where I should enable this setting back ?

Kind regards,

Thomas

bom-d-van commented 3 years ago

Hi @loitho , if scan-frequency is set to 0. No index is built. trie-index = true is a no-op.

It's a trade-off in the current system. Without index, your queries might become slower as it falls back to using filesystem globing, but things should continue to work.

How many memory does your server have? If the memory capacity of the server is big enough, the kernel should be able cache all the file system metadata in memory and you shouldn't have too much io issues caused by scanning directories.

Not sure how many people are having with issues with file system scanning. But now with concurrent and realtime indexing support in trie-index, we should be able to support indexing without scanning.

loitho commented 3 years ago

Hi @bom-d-van thank you for your quick reply ! I understand, so basically, carbonserver is behaving like a graphite web instance and looking at the whisper file. Which is honnestly still a pretty good thing as it stops me from having to install a graphite-web (nginx + gunicorn) on each of my node.

if scan-frequency is set to 0. No index is built. trie-index = true is a no-op.

Makes sense, I didn't see any "build index time" on the graph so I assumed so :)

How many memory does your server have? if the memory capacity of the server is big enough, the kernel should be able cache all the file system metadata in memory and you shouldn't have too much io issues

Each node has 32 GB of RAM, is there a way to make sure the kernel has put the file system metadata in cache ? Some info on the system, we're running CentOS 7.9 with the standard kernel (3.10) and with an XFS filesystem + noatime on the mount point and go-carbon 0.15.5 And the optimization suggested on the go-carbon documentation regarding memory Just to illustrate what happens when the scan is enabled every 15 minutes , this is the load and this is the IOPS (on a 12K iops disk) Of course, the more read it tries to cram, the less writes and the less writes the more load there is, hte longer it takes etc ...

But now with concurrent and realtime indexing support in trie-index, we should be able to support indexing without scanning.

That would be awesome !

bom-d-van commented 3 years ago

@loitho can you also share the graph for memory and disk write metrics? Also, with collectd, I think there are merged read/write iops as well, can you also share that. Just trying to understand more of your system resource usage level.

bom-d-van commented 3 years ago

Each node has 32 GB of RAM, is there a way to make sure the kernel has put the file system metadata in cache ?

I haven't tweaked it myself, but you can try google it a bit and find some proper kernel tuning parameters. This one might do the job:

https://unix.stackexchange.com/a/76750/22938

vfs_cache_pressure

Controls the tendency of the kernel to reclaim the memory which is used for caching of directory and inode objects.

At the default value of vfs_cache_pressure=100 the kernel will attempt to reclaim dentries and inodes at a "fair" rate with respect to pagecache and swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will never reclaim dentries and inodes due to memory pressure and this can easily lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 causes the kernel to prefer to reclaim dentries and inodes.

(@deniszh or @azhiltsov might have better suggestion/knowledge on this area.)

loitho commented 3 years ago

Hi again, Sure, this is the READ / Write graph : read is green (top), write is yellow (bottom) I used the host 1 but they all are the same

For the merged IO : Combined :

I asked the question and started googling as well (should have done the other way around) and found the same thread, I'm gonna poke around with this option and look for more information

loitho commented 3 years ago

Sorry, forgot the memory as well : We interestingly see that the cached data has huge variations every time there is a scan

bom-d-van commented 3 years ago

Hmm, I don't think I understand this memory usage pattern. Lots of memory are freed and then used as cache.

Can you also share the cache.queueWriteoutTime, persister.updateOperations, and persister.committedPoints from graphite? Want to see if there is any connections.

At the same time, you can also try enabling concurrent-index and realtime-index using the config bellow. With this config, go-carbon only keeps one copy of index in memory.

scan-frequency = "5m0s"
trie-index = true
concurrent-index = true
realtime-index = 500000

If the above config helps, you can also increase scan-frequency to 30m or more.

Also it just occur to me, 32GB of ram is big enough to keep all the dentires and inode in memory. For 650000 metrics/files, it should just be a few hundred MBs (at most 1GB). But I'm just speculating.

loitho commented 3 years ago

Thank you for your help, I think that it's due to the variation of the number of updates; They drops due to the high number of reads that squash the number of write : When the updates get lower, the number of point per update increase, the number of commited point gets lower tho': And the queue writeout time :

We have more around 3 750 000 metrics per node due to lot of machines being autoscaled etc ... I have found more info here too : https://unix.stackexchange.com/questions/30286/can-i-configure-my-linux-system-for-more-aggressive-file-system-caching Thank you for your configuration, I'll try it.

You also have the "trigram-index" disabled, correct ?

azhiltsov commented 3 years ago

Looking at IOPS/LA graph I conclude you are using the spinning disk or array, not SSD. Am I right here? Do you run only go-carbon on the box, or is there anything else what can interfere (I do not like the memory free/cached/used graph pattern)? How big is the [cache] max-size ? And how much of the memory is allocated by go-carbon itself?

Normally you shouldn't see page caches to be evicted as the go-carbon performance is heavily relying on them.

loitho commented 3 years ago

Hi @azhiltsov thank you for your answer, No, we're running 16K peak, 12K sustained IOPS SSD GP3 disks from AWS actually. I'm curious, how does the pattern tells you what type of disk we're running ? Only go-carbon and buckyd are running on the box. I stopped buckyd to check and go-carbon that uses most of the memory :) Max cache size is 10 Millions When stopping the Carbon server index, all the graph gets much nicer and flatter, hence why I created this thread here for the queue writeout time And commited point :

PS I haven't tried the suggestion and configuration above yet as my day already ended :)

azhiltsov commented 3 years ago

No, we're running 16K peak, 12K sustained IOPS SSD GP3 disks from AWS actually. I'm curious, how does the pattern tells you what type of disk we're running ?

This is very common observation of mine from the past (not related to go-carbon) if disk is saturated then LA is going up because your cores are waiting for IO. I might be wrong.

I think you facing two problems

a throttling from AWS. according to this you allowed to have only up to 16000 iops with 16KiB per IO and you probably doing 512 or 1KiB IO operations which converted up to the FS block size.
Lack of memory to keep your caches around.

Since the index is only needed to speed-up a queries its up to you to decide to use it or not. If your queries are fast enough - disable the indexes and this is your first solution.

You need more speed? Get more memory. Start with 64G. If not enough bump it further. The whole performance paradigm of go-carbon is build on top of keeping as much of disk activity in page caches as possible. So you need to make sure that your caches are staying put in the memory and never evicted. This is your second solution.

Extra performance points: We are running on enterprise grade SSD which can go up to 200K IO and we are using 4096 blocks on the XFS filesystem (which is default but worth to check) This is our mount option (might cost you extra memory) rw,noatime,nodiratime,attr2,inode64,logbufs=8,logbsize=256k,noquota And we are running 4.19 kernel as it was ~20%? (don't remember) faster. But this shouldn't affect the memory consumprtion, so probably irrelevant. Also we are running quite old go-carbon 0.14 compiled with go 1.12 something, so might be, might be newer version of either golang or go-carbon doing something with memory differently, but I can not tell.

loitho commented 3 years ago

Hi, it's me again, with some news : @azhiltsov

This is very common observation of mine from the past (not related to go-carbon) if disk is saturated then LA is going up because your cores are waiting for IO. I might be wrong.

Sorry for the dumb question but what is "LA" ? Yeah basically we see that the limiting factor is IOPS. Fun fact, AWS throttle the instance to 12K IOPS sustained and 16K peaks for half an hour a day guaranteed.

Now, here is what I tried,

First I updated to go-carbon 0.15.6 (thank you for the fix and the ARM64 build, it'll serve us in the future !)

vm.vfs_cache_pressure = 1 on every even node of our cluster (as opposed to the 100 by default), With the following config on our cluster :

scan-frequency = "30m0s"
trie-index = true
concurrent-index = true
realtime-index = 500000

Our machines are m5.2xlarge on AWS with GP3 disks with 16K IOPS / 200 MBps here is the result on file scan time. Know that node 1 and node 2 are doing the exact same thing (receiving same metrics etc ...): We see that the initial crawl for data is slower, but, that once it's done the query are faster (nearly 2x)

Queue writeout time doesn't have any spike, pretty good ! :

Let's check the load : Okay we see that with the vfs pressure to 1, the cluster seem to have less load spikes on each scan (after a huge load for the scan), but the average load is a tad bit higher. What about IOPS ? Seems like the higher load is due to more read. We see the read spike on the machine with the default configuration, but once the spike has gone, there is nearly 0 reads.

The memory interestingly shows that we indeed are storing more of the folder and file tree information in memory :

Update per second also greatly improves, because there isn't an IO spike anymore, the update per seconds doesn't drop:

So, everything is perfect ?

Well ... not really, first of all after 24 hours of runtime, for some reason, some of the nodes started having huge load and reading a lot, for seemingly no reason ? (maybe the kernel removed from memory the index, I don't know)

I think if you have a lot of memory, and disks with more sustained IOPS, and probably a better Kernel than the 3.10 from our machines, it might makes sense for you to try the setting. As one would say, "100% of winners have tried their luck !"

Then reconfigured everything to default and I tried to have just an index every 6 Hours :

Looks pretty good ! And it suits better my disks, as they're meant to have a 30 Mn Burst period every 24 hours.

Conclusion :

First of all, thank you all again for your help. vfs_cache_pressure is an interesting setting to play with, definitely check it out.

I had a final question, since realtime-index uptade the index well ... in real time, is there any point to regularly run the scan ? Is the scan only there to update when files are deleted from the disk ? (If that's true, then I could push the scan time even higher as my cluster is cleaned up only once every day.)

Kind regards,

bom-d-van commented 3 years ago

It's a nice and detailed report. So most of our reasoning appears to be correct.

after 24 hours of runtime, for some reason, some of the nodes started having huge load and reading a lot, for seemingly no reason

Does that coincide with the clean-up on the clusters?

I had a final question, since realtime-index uptade the index well ... in real time, is there any point to regularly run the scan ? Is the scan only there to update when files are deleted from the disk ? (If that's true, then I could push the scan time even higher as my cluster is cleaned up only once every day.)

Yes, it's for deletions. Eventually we can add a delete api in go-carbon, along with realtime-index and concurrent-index, we can stop disk scanning completely.

All the new logics that we introduced are incremental and slowly evolving. So it might seem odd looking the implementation now. go-carbon starts without in-memory indexing, then it has trigram-index, then trie-index, now with concurrent-index and realtime-index.

One last tip, since you prefer to have reduced disk io caused by indexing. You might also want to try this feature out. file-list-cache caches the disk scan result at the specified filepath. This means that after restart, it doesn't re-scan the whole disk immediately, but trying to read the whole file list from the cached file.

# Cache file list scan data in the specified path. This option speeds
# up index building after reboot by reading the last scan result in file
# system instead of scanning the whole data dir, which could takes up
# most of the indexing time if it contains a high number of metrics (10
# - 40 millions). go-carbon only reads the cached file list once after
# reboot and the cache result is updated after every scan. (EXPERIMENTAL)
file-list-cache = ""

loitho commented 3 years ago

Yeah, you were pretty much spot on ! (not that I doubted it, but if anyone stumble on the thread, he'll have some good information backed by graph :) )

Does that coincide with the clean-up on the clusters?

No sadly, it didn't, that's why I found it so odd, nothing very interesting in the logs too.

I think the implementation makes sense, I'm just trying to understand it fully as well as its limitations :)

You might also want to try this feature out. file-list-cache

Ah yes, I read about it and immediately started using it when I changed the cluster configuration for the one proposed by you, it works flawlessly it's really awesome !

So it might seem odd looking the implementation now

It makes sense from an evolutionary stand point, as you gradually add functionalities, but I think that with a bit more precision in the documentation, like explaining the interaction between "old and new functions" like the fact that when enabling realtime-index you can actually bump up the scan frequency because then, its only purpose is to purge the deleted file from the index. Would be interesting

Would you mind if I made a PR to add those informations to the documentation ?

bom-d-van commented 3 years ago

Would you mind if I made a PR to add those informations to the documentation ?

Yep, it a good idea. Thanks in advance! :D

go-graphite / go-carbon