After using ansible `redis` roles, pmproxy.log filled our log volume

portante commented 2 years ago

26GB pmproxy.log file on disk.

Filled with lines like:

[Sun Jan 30 16:35:33] pmproxy(2388513) Error: SADD pcp:series:label.X.value:Y
Redis reply error: MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.

portante commented 2 years ago

Why are PCP processes logging to disk in separate files? Why isn't it using syslog logging to land in journald?

andreasgerstmayr commented 2 years ago

Why are PCP processes logging to disk in separate files? Why isn't it using syslog logging to land in journald?

We've an open ticket for that: https://github.com/performancecopilot/pcp/issues/1040 but it didn't get much traction so far.

Redis reply error: MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.

Did Redis run out of disk space while snapshotting, could you check the Redis logs?

portante commented 2 years ago

We used the redis role to configure it. I does not appear to have any files on disk open, but has created a /var/lib/redis/dump.rdb file.

The errors in the redis logs say:

4216:M 30 Jan 2022 13:00:36.034 * 10000 changes in 60 seconds. Saving...
4216:M 30 Jan 2022 13:00:36.633 * Background saving started by pid 1839362
1839362:C 30 Jan 2022 13:26:34.699 * DB saved on disk
1839362:C 30 Jan 2022 13:26:35.440 * RDB: 528 MB of memory used by copy-on-write
4216:M 30 Jan 2022 13:26:36.430 * Background saving terminated with success
4216:M 30 Jan 2022 13:27:37.050 * 10000 changes in 60 seconds. Saving...
4216:M 30 Jan 2022 13:27:37.681 * Background saving started by pid 2271458
2271458:C 30 Jan 2022 13:53:44.483 # Write error saving DB on disk: No space left on device
4216:M 30 Jan 2022 13:54:07.907 # Background saving error

But the only thing using substantial disk space is the pmproxy.log file.

After expanding the volume, the redis service is back to normal.

andreasgerstmayr commented 2 years ago

Redis by default is snapshotting its entire database periodically on disk. When you restart Redis, it will load the contents of dump.rdb into memory (see https://redis.io/topics/persistence for more info). There's another related issue open regarding Redis persistence: https://github.com/performancecopilot/ansible-pcp/issues/22 - we'll need to look into tuning the Redis persistence settings.

But the only thing using substantial disk space is the pmproxy.log file.

Looks like Redis stops all write requests if it can't snapshot the database to the disk successfully. Makes sense from a durability point of view, otherwise you'll loose data when Redis gets restarted.

In PCP we log all errors coming from Redis. Not sure what's the best solution in this case - keep track of logged errors and only log each Redis error type once? We have this implemented for Redis reconnect (to not flood the pmproxy log with "trying to connect to redis..." messages), but I'm not 100% convinced that it makes sense to apply this for every single error type. What do you think @natoscott?

portante commented 2 years ago

This is why I asked about PCP logging to files. If it logged to journald either vis syslog or stdout/stderr from systemd process creation, then the disk would not have filled up.

andreasgerstmayr commented 2 years ago

This is why I asked about PCP logging to files. If it logged to journald either vis syslog or stdout/stderr from systemd process creation, then the disk would not have filled up.

Ah, because journald automatically manages log rotation. Yep, I'm also very much in favor of supporting journald.

afaics we have the following options a) drop the /usr/libexec/pcp/lib/pmproxy shell script, move this logic into the pmproxy binary and start the pmproxy binary directly, logging to stdout b) call /usr/libexec/pcp/lib/pmproxy in ExecStartPre= and start the pmproxy binary in ExecStart= c) or change the internal log routines to use (optionally) journald

I think option a) is the cleanest but the most effort, alternatively option b) is also fine imho. What we need to keep in mind is that PCP is also supported on non-systemd platforms.

natoscott commented 2 years ago

@andreasgerstmayr I believe this is one of the areas @goodwinos is planning on tackling for PCP v6.

Another option (e.g. non-systemd and/or pre v6) would be to install logrotate(8) files. I don't think we can reliably attempt to throttle logging of all possible Redis issues in pmproxy, and logrotate is a generic solution we could apply to other PCP daemon logs as well.

portante commented 2 years ago

See this P99 talk on the problem with logging. Until we properly solved that, for system administrators, journald helps the most, as logrotate based solutions can't react fast enough in all cases.

portante commented 2 years ago

Just to be sure, here is a series of grep commands I used to filter the entire contents of the 26 GB pmproxy.log file before removing it due to size issues:

[root@intlabproxy-002 pmproxy]# grep -vF "not able to persist on disk" pmproxy.log | grep -vF "status mismatch at time" | grep -vF "HMSET:" | grep -vF "SADD pcp:" | grep -vF "HSET: string mapping script" | grep -vF "SADD: mapping" | grep -vF "SADD: map series" | grep -vF "SADD map metric" | grep -vF "GEOADD: mapping"
Log for pmproxy on intlabproxy-002 started Sun Jan 30 00:26:01 2022

pmproxy: PID = 2388513, PDU version = 2, user = pcp (991)
pmproxy request port(s):
  sts fd   port  family address
  === ==== ===== ====== =======
  ok    10       unix   /run/pcp/pmproxy.socket
  ok    11 44322 inet   INADDR_ANY
  ok    12 44322 ipv6   INADDR_ANY
  ok    13 44323 inet   INADDR_ANY
  ok    14 44323 ipv6   INADDR_ANY
[Sun Jan 30 00:26:01] pmproxy(2388513) Info: OpenSSL 1.1.1g FIPS  21 Apr 2020 - no certificates found
[Sun Jan 30 00:26:01] pmproxy(2388513) Info: Redis slots, command keys, schema version setup
Redis reply error: ERR wrong number of arguments for XADD
Redis reply error: ERR wrong number of arguments for XADD
Redis reply error: ERR wrong number of arguments for XADD
Redis reply error: MISCONF

natoscott commented 2 years ago

We have this implemented for Redis reconnect (to not flood the pmproxy log with "trying to connect to redis..." messages), but I'm not 100% convinced that it makes sense to apply this for every single error type. What do you think @natoscott?

It's not a bad idea IMO. I can see no gain from continually re-logging the same error repeatedly for continually failing operations - culling or rate limiting to perhaps once per day makes more sense to me.

natoscott commented 2 years ago

To summarize - it looks like all actions to take here are in PCP itself. Is there any default redis.conf change we should make in the role, or shall I close this one out and we can continue discussions in the PCP issue trackers? Thanks.

andreasgerstmayr commented 2 years ago

Afaics there is no change to redis.conf required here, the log (spam) is produced by PCP.

shall I close this one out and we can continue discussions in the PCP issue trackers

+1, I'll move this ticket over to the pcp repo.

performancecopilot / pcp

After using ansible `redis` roles, pmproxy.log filled our log volume #1536