Open portante opened 2 years ago
Why are PCP processes logging to disk in separate files? Why isn't it using syslog logging to land in journald?
Why are PCP processes logging to disk in separate files? Why isn't it using syslog logging to land in journald?
We've an open ticket for that: https://github.com/performancecopilot/pcp/issues/1040 but it didn't get much traction so far.
Redis reply error: MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.
Did Redis run out of disk space while snapshotting, could you check the Redis logs?
We used the redis
role to configure it. I does not appear to have any files on disk open, but has created a /var/lib/redis/dump.rdb
file.
The errors in the redis logs say:
4216:M 30 Jan 2022 13:00:36.034 * 10000 changes in 60 seconds. Saving...
4216:M 30 Jan 2022 13:00:36.633 * Background saving started by pid 1839362
1839362:C 30 Jan 2022 13:26:34.699 * DB saved on disk
1839362:C 30 Jan 2022 13:26:35.440 * RDB: 528 MB of memory used by copy-on-write
4216:M 30 Jan 2022 13:26:36.430 * Background saving terminated with success
4216:M 30 Jan 2022 13:27:37.050 * 10000 changes in 60 seconds. Saving...
4216:M 30 Jan 2022 13:27:37.681 * Background saving started by pid 2271458
2271458:C 30 Jan 2022 13:53:44.483 # Write error saving DB on disk: No space left on device
4216:M 30 Jan 2022 13:54:07.907 # Background saving error
But the only thing using substantial disk space is the pmproxy.log
file.
After expanding the volume, the redis service is back to normal.
Redis by default is snapshotting its entire database periodically on disk. When you restart Redis, it will load the contents of dump.rdb
into memory (see https://redis.io/topics/persistence for more info). There's another related issue open regarding Redis persistence: https://github.com/performancecopilot/ansible-pcp/issues/22 - we'll need to look into tuning the Redis persistence settings.
But the only thing using substantial disk space is the pmproxy.log file.
Looks like Redis stops all write requests if it can't snapshot the database to the disk successfully. Makes sense from a durability point of view, otherwise you'll loose data when Redis gets restarted.
In PCP we log all errors coming from Redis. Not sure what's the best solution in this case - keep track of logged errors and only log each Redis error type once? We have this implemented for Redis reconnect (to not flood the pmproxy log with "trying to connect to redis..." messages), but I'm not 100% convinced that it makes sense to apply this for every single error type. What do you think @natoscott?
This is why I asked about PCP logging to files. If it logged to journald either vis syslog or stdout/stderr from systemd process creation, then the disk would not have filled up.
This is why I asked about PCP logging to files. If it logged to journald either vis syslog or stdout/stderr from systemd process creation, then the disk would not have filled up.
Ah, because journald automatically manages log rotation. Yep, I'm also very much in favor of supporting journald.
afaics we have the following options
a) drop the /usr/libexec/pcp/lib/pmproxy
shell script, move this logic into the pmproxy binary and start the pmproxy binary directly, logging to stdout
b) call /usr/libexec/pcp/lib/pmproxy
in ExecStartPre=
and start the pmproxy binary in ExecStart=
c) or change the internal log routines to use (optionally) journald
I think option a) is the cleanest but the most effort, alternatively option b) is also fine imho. What we need to keep in mind is that PCP is also supported on non-systemd platforms.
@andreasgerstmayr I believe this is one of the areas @goodwinos is planning on tackling for PCP v6.
Another option (e.g. non-systemd and/or pre v6) would be to install logrotate(8) files. I don't think we can reliably attempt to throttle logging of all possible Redis issues in pmproxy, and logrotate is a generic solution we could apply to other PCP daemon logs as well.
See this P99 talk on the problem with logging. Until we properly solved that, for system administrators, journald
helps the most, as logrotate
based solutions can't react fast enough in all cases.
Just to be sure, here is a series of grep
commands I used to filter the entire contents of the 26 GB pmproxy.log
file before removing it due to size issues:
[root@intlabproxy-002 pmproxy]# grep -vF "not able to persist on disk" pmproxy.log | grep -vF "status mismatch at time" | grep -vF "HMSET:" | grep -vF "SADD pcp:" | grep -vF "HSET: string mapping script" | grep -vF "SADD: mapping" | grep -vF "SADD: map series" | grep -vF "SADD map metric" | grep -vF "GEOADD: mapping"
Log for pmproxy on intlabproxy-002 started Sun Jan 30 00:26:01 2022
pmproxy: PID = 2388513, PDU version = 2, user = pcp (991)
pmproxy request port(s):
sts fd port family address
=== ==== ===== ====== =======
ok 10 unix /run/pcp/pmproxy.socket
ok 11 44322 inet INADDR_ANY
ok 12 44322 ipv6 INADDR_ANY
ok 13 44323 inet INADDR_ANY
ok 14 44323 ipv6 INADDR_ANY
[Sun Jan 30 00:26:01] pmproxy(2388513) Info: OpenSSL 1.1.1g FIPS 21 Apr 2020 - no certificates found
[Sun Jan 30 00:26:01] pmproxy(2388513) Info: Redis slots, command keys, schema version setup
Redis reply error: ERR wrong number of arguments for XADD
Redis reply error: ERR wrong number of arguments for XADD
Redis reply error: ERR wrong number of arguments for XADD
Redis reply error: MISCONF
We have this implemented for Redis reconnect (to not flood the pmproxy log with "trying to connect to redis..." messages), but I'm not 100% convinced that it makes sense to apply this for every single error type. What do you think @natoscott?
It's not a bad idea IMO. I can see no gain from continually re-logging the same error repeatedly for continually failing operations - culling or rate limiting to perhaps once per day makes more sense to me.
To summarize - it looks like all actions to take here are in PCP itself. Is there any default redis.conf change we should make in the role, or shall I close this one out and we can continue discussions in the PCP issue trackers? Thanks.
Afaics there is no change to redis.conf
required here, the log (spam) is produced by PCP.
shall I close this one out and we can continue discussions in the PCP issue trackers
+1, I'll move this ticket over to the pcp repo.
26GB pmproxy.log file on disk.
Filled with lines like: