Open mhjacks opened 2 years ago
I'm currently experimenting with significantly dialing down the save interval (save 3600 1 only), but I'm not sure of how well that will work with the semantics of how pmproxy/pmlogger work.
Changing the save interval or Redis persistence settings doesn't affect pmproxy. The Redis save interval is a tradeoff between I/O load and durability - the higher the interval, the more metric values will be lost in case of a crash of the Redis daemon or host. If you don't need historical values at all in case Redis restarts or crashes, you can also disable Redis persistence completely, removing any I/O load from the Redis database (save ""
in the Redis configuration).
In our first revision of the scaling guidelines we focused mainly on stability and CPU & memory usage, because I/O load wasn't a bottleneck in our test environment. We definitely should test different Redis persistence settings (configuring AOF (Append Only File) for Redis persistence sounds promising) in a future update of the scaling guidelines.
Understood wrt to pmproxy. I did try using AOF for Redis, so that the system would not be constrained by IO (some history is good, after reboots and such). I still ran out of memory at 32GB for my 20-server fleet.
Metrics storage is clearly a balancing act - the need to preserve more data directly makes it more complex to store that data (and constraining the amount of data that can be visualized by the collector's memory is clearly suboptimal - though admittedly there aren't any better options currently within the confines of RHEL, at least). The on-disk storage model for PCP seems remarkably good - what we need is a reasonable timeseries data store. There are good ones that PCP already supports (elasticsearch and influxdb), but it sure would be nice to have something that is not, at minimum, open-core. (And also good to have something that grafana natively supports, to make it straightforward to ship example graphs for visualization.)
I have a group of 16 servers that I'm monitoring using ansible-pcp. I've added a few pmda's and I've left the other settings (sampling interval and retention period) at defaults. My metrics collection system was overwhelmed trying to keep up with the metrics reported, and I am theorizing that the redis save interval was responsible for the high rates of IO my system reported.
The metrics collection host is now a 4 vCPU, 16 GB RAM VM with an 80GB disk. (That seems sufficient based on the sizing guidance in https://pcp.readthedocs.io/en/latest/HowTos/scaling/index.html). On a previous iteration set up using this collection (on Fedora 35), the rdb file in /var/lib/redis was almost continuously being written, and that rdb file was continuously growing. It would get killed by systemd-oom beyond a certain point, depending on the amount of RAM I configured the VM with.
Below is a sample graph from the second setup I tried, this time running on CentOS 8-stream (and Redis 5). The same save settings by default, with increasing amounts of disk I/O:
This represents the first 4 hours of reporting from a newly set up collector.
I'm currently experimenting with significantly dialing down the save interval (save 3600 1 only), but I'm not sure of how well that will work with the semantics of how pmproxy/pmlogger work.
I'm willing to PR some changes and take input on them working through this - including if I'm off base. (I am new to the pcp toolset, and to redis). Thanks for your time and attention!