redpanda: all sub-systems can improve logging

dotnwat commented 3 years ago

A lot of our logging is quite difficult to interpret. This ranges from messages with too much context to messages with too little, as well as messages that are really only meaningful in the context of the code. Other problems we've seen are exception handlers that provide little to no context in their error messages. Another issue is that we do not always choose good log levels, and this might result in spamming the log or in the other extreme, needing to drop to trace level to get something important.

I think it would be useful to (1) have all devs make a pass on logging in their particular subsystem and (2) establish a pipeline for easily getting examples of bad logging into tickets so they can be fixed.

JIRA Link: CORE-678

emaxerrno commented 3 years ago

@dswang can help w/ creating tix for everyone

dotnwat commented 3 years ago

here is one recent example:

ERROR 2021-07-12 07:54:52,461 [shard 1] rpc - Error dispatching client reads: std::__1::system_error (error system:110, read: Connection timed out)
INFO  2021-07-12 07:40:33,528 [shard 0] rpc - transport.cc:208 - Request timeout, correlation id: 1091

Problem: There is nothing wrong with this log message per se, but it is effectively content free without (1) knowledge of code paths -- it is a very low level place in the stack and (2) surrounding logging context.

Solution: (1) logging for rpc errors should when possible be done at a higher level to provide context. this may mean we need to extend rpc to allow us to add context for better logging (2) these very generic errors can be logged at trace level when we are reliably logging them with more context elsewhere.

@mmaslankaprv thoughts on how to improve this for tech debt work on improving logging?

dotnwat commented 3 years ago

INFO  2021-07-21 03:02:53,292 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:1122 - Recovered persistent state from kvstore: voted for: {id: {1}, revision: {0}}, term: 196
INFO  2021-07-21 03:02:53,302 [shard 0] storage - segment.cc:629 - Creating new segment /var/lib/redpanda/data/redpanda/kvstore/0_0/7457-0-v1.log
INFO  2021-07-21 03:02:53,314 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:150 - Stopping
INFO  2021-07-21 03:02:53,315 [shard 3] kvstore - kvstore.cc:105 - Stopping kvstore: dir /var/lib/redpanda/data/redpanda/kvstore/3_0
INFO  2021-07-21 03:02:53,315 [shard 1] kvstore - kvstore.cc:105 - Stopping kvstore: dir /var/lib/redpanda/data/redpanda/kvstore/1_0
INFO  2021-07-21 03:02:53,315 [shard 2] kvstore - kvstore.cc:105 - Stopping kvstore: dir /var/lib/redpanda/data/redpanda/kvstore/2_0
INFO  2021-07-21 03:02:53,315 [shard 0] kvstore - kvstore.cc:105 - Stopping kvstore: dir /var/lib/redpanda/data/redpanda/kvstore/0_0
INFO  2021-07-21 03:02:53,316 [shard 0] redpanda::main - application.cc:158 - Failure during startup: std::invalid_argument (Unable to add configuration at offset 0 as it already exists)

(1) There are several code paths that might lead to the exception that is thrown. Either more logging is required at log level info for such an error case, or a stack trace is needed. (2) there is no context at all for the state of the controller when the error occurs. So even though this is probably happening in startup, there is not enough information to work back through how the state might have been reached. Especially at start up, we have no reason at all to not log everything in great detail since it only happens once.

dotnwat commented 6 months ago

This ticket isn't well defined enough to be actionable. When it was written the system was small enough that it was feasible.

redpanda-data / redpanda

redpanda: all sub-systems can improve logging #1795