Supportability: improve own logs

tigrannajaryan commented 3 years ago

The Collector's own logs are an important source of information for troubleshooting. In some cases own logs, available locally are the only available information for troubleshooting. Other source, such as own metrics require the Collector to be correctly configured to scrape itself, send the metrics to the backend and for the backend to be available. Even zPages, which are exposed locally by the Collector may not be available if for example the Collector crashes. In such cases logs are the only useful source of troubleshooting information.

In order to increase the value of Collector's log I suggest to make a few improvements:

[x] Make logs more human-readable locally: switch from "json" to "console" encoding even when "release" logging configuration is selected (but keep Production zap Logger configuration for "release" mode). This is still fairly well-defined format that can be unambiguously parse on the backend if needed https://github.com/open-telemetry/opentelemetry-collector/issues/2106
[ ] Periodically output values of Collector's own metrics in the log: https://github.com/open-telemetry/opentelemetry-collector/issues/2098
[ ] Introduce a rate-limited logger that all components can use to log failures during operation. Exporter helper already does this for most exporters that use queued_retry. We will need similar capabilities for receivers, processors and extensions.
[ ] Ensure all core exporters correctly report errors (and that such errors are visible in the logs) when destination is unavailable and when destination is available but responds with an error that is likely a result of misconfiguration (e.g. HTTP 404 due to an incorrect destination endpoint in the config). Additional ideas that may be worth doing:
[ ] When data is not received at the backend it is not clear where it is lost. Ensure all pipeline components clearly expose the counters about the data they see, dump the counters to logs periodically (e.g. every minute).
[ ] When receivers see zero data for a while since startup (e.g. for 1 minute) log a warning (unless it is a kind of receiver where it is normal to not see any data for a long time).

Note: we need to be careful to not flood the logs.

jpkrohling commented 3 years ago

Can we please discuss this before continuing? I'm not sure the first item in the list is a good move, especially without a heads-up to current users of the collector.

I think moving from structured JSON logs to "mix of tab-separated and JSON" logs is a step back. People who want to have human-readable logs should be able to use the dev profile, whereas the vast majority of people could still use the JSON features.

Before this change, I could simply pipe the console output to jq and extract fields I'm interested in. With this change, I have to use awk and hope that I get the correct fields from the logs.

At the very least, this new (breaking) change should be behind a flag so that people can control which behavior they want.

Before #2106/#2109, where how logs looked like in the --log-profile=prod:

{"level":"info","ts":1605170062.2534976,"caller":"builder/receivers_builder.go:75","msg":"Receiver started.","component_kind":"receiver","component_type":"otlp","component_name":"otlp"}

And here's how it looks now:

2020-11-12T09:34:31.205+0100    info    builder/receivers_builder.go:75 Receiver started.   {"component_kind": "receiver", "component_type": "otlp", "component_name": "otlp"}

tigrannajaryan commented 3 years ago

@jpkrohling Yes, let's discuss.

I understand what you say, however I disagree.

Collector is a unique piece of infrastructure. It needs to be observable and debuggable even when the observability system is broken. This means we cannot rely only on traditional approaches to observe the Collector. One of the important areas where I think we have to be different from the traditional practices is readability of logs. Typically for every other service using JSON is highly desirable since it is machine readable and can be collected in the logging system precisely where it is searchable and queryable and where most people will be looking at the logs. With the Collector we cannot rely on that. Instead it is much more important for logs to be easily consumable by humans who only have a console available to them. Piping JSON logs to jq does not result in sufficiently readable logs, they are still very difficult to read.

Tab-delimited logs are vastly more readable when all you have is the console.

At the very least, this new (breaking) change should be behind a flag so that people can control which behavior they want.

I agree. I think non-JSON logs should be the default and we can have a command line option to output JSON logs.

We can also add this change to the CHANELOG to bring more visibility to it.

tigrannajaryan commented 3 years ago

@jpkrohling see https://github.com/open-telemetry/opentelemetry-collector/pull/2177

jpkrohling commented 3 years ago

Thanks for addressing my concerns, @tigrannajaryan!

Collector is a unique piece of infrastructure. It needs to be observable and debuggable even when the observability system is broken.

Agree, but I think it ends up depending on how we see/deal with the collector in production. If we have only a few instances (pet), we are likely to look at logs for the individual instances. However, for highly elastic scenarios (cattle), we'd rather have the logs being sent elsewhere and post-processed. In which case, having it easy to be parsed by machines is preferable.

Based on the same train of thought, #2098 might probably not bring many benefits.

pkositsyn commented 3 years ago

My issue is closed as a duplicate, so I'd like to comment it here. I have a question about this point

Introduce a rate-limited logger that all components can use to log failures during operation. Exporter helper already does this for most exporters that use queued_retry. We will need similar capabilities for receivers, processors and extensions.

I think there is a simple solution to add a global system logs rate limit with a corresponding flag.

Why do I think it's actually enough for any purposes:

This is a somewhat hard limit for logs. You know that more logs than this limit won't give any information or more logs means overload of logs delivery pipeline or just a performance degradation due to excessive communication with console.
This limit is something you should never reach. Reaching means some critical error in code (usually logging every request meeting some condition). Again, this is just safety of the whole pipeline.

Even if you don't like the solution that I offer, this is a critical thing we cannot limit the total number of logs. Some implemented components do logging on every request (maybe error request, but anyway) and it's very hard to get rid of all those places

jpkrohling commented 3 years ago

Some implemented components do logging on every request (maybe error request, but anyway)

I agree with your proposal as a whole, and I think our logger (zap) does support rate limiting. In any case, it would be good to have bug reports against those components. Logging on error is desirable, but we might be able to optimize hot paths...

pkositsyn commented 3 years ago

Yes, there is actually a possibility to tune logger. It's a pity I cannot do it from the box. Need to do manipulations with code. Is there a position on not exposing more flags?

jpkrohling commented 3 years ago

Is there a position on not exposing more flags?

Not that I'm aware of. Each new flag comes with the need to document and maintain it, but nothing will prevent us from adding one if they are justifiable.

NickLarsenNZ commented 1 year ago

As for formatting, I think logfmt should be an option alongside JSON (and perhaps use the ZAP log structure). I find it far more readable raw, but still parsable (eg: within Grafana).

open-telemetry / opentelemetry-collector

Supportability: improve own logs #2102