vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.13k stars 1.6k forks source link

Standardization/structuring of error/warning log #16285

Open mans2singh opened 1 year ago

mans2singh commented 1 year ago

A note for the community

Use Cases

Hi:

I am working with vector running in kubernetes and need to filter logs with errors. However, I did not see any pattern to extract error from the logs logs.

I think a structured error message or consistent pattern would help applications to programmatically extract relevant the reason/error.

Here are few examples:

  1. Splunk hec sink ssl error: 2023-02-03T14:51:45.181367Z WARN sink{component_kind="sink" component_id=alerts_sink component_type=splunk_hec_logs component_name=alerts_sink}:request{request_id=1}:http: vector::internal_events::http_client: HTTP error. error=error trying to connect: error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:1921:: self signed certificate in certificate chain error_type="request_failed" stage="processing" internal_log_rate_limit=true

  2. Http misconfigured url warning: 2023-02-05T11:10:01.814103Z WARN sink{component_kind="sink" component_id=alerts_sink component_type=http component_name=alerts_sink}:request{request_id=0}:http: vector::internal_events::http_client: HTTP error. error=error trying to connect: dns error: failed to lookup address information: Name or service not known error_type="request_failed" stage="processing" internal_log_rate_limit=true 2023-02-05T11:10:01.814154Z WARN sink{component_kind="sink" component_id=alerts_sink component_type=http component_name=alerts_sink}:request{request_id=0}: vector::sinks::util::retries: Retrying after error. error=Failed to make HTTP(S) request: error trying to connect: dns error: failed to lookup address information: Name or service not known internal_log_rate_limit=true

  3. Http redirect error: 2023-02-03T16:04:14.416107Z ERROR sink{component_kind="sink" component_id=alerts_sink component_type=http component_name=alerts_sink}:request{request_id=0}: vector::sinks::util::retries: Not retriable; dropping the request. reason="response status: 301 Moved Permanently" internal_log_rate_limit=true

If there is a pattern or structure to extract the error reason programmatically that I have missed, please let me know.

Thanks

Attempted Solutions

I tried finding a common pattern for errors but did not see any.

Proposal

The error messages can have consistent pattern or fields that can be used programmatically to extract the error reason/code etc. Also, all errors should be logged at that level. In the example above ssl error for splunk hec is logged at warn level but message indicates error:

2023-02-03T14:51:45.181367Z WARN sink{component_kind="sink" component_id=alerts_sink component_type=splunk_hec_logs component_name=alerts_sink}:request{request_id=1}:http: vector::internal_events::http_client: HTTP error. error=error trying to connect: error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:1921:: self signed certificate in certificate chain error_type="request_failed" stage="processing" internal_log_rate_limit=true

References

No response

Version

timberio/vector:0.25.1-debian

jszwedko commented 1 year ago

Thanks for filing this @mans2singh! This is something we have discussed internally at times but there wasn't a tracking issue.

There is some standardization as you can see in the logs with error_type, reason, and stage, but we could do more.

spencergilbert commented 1 year ago

Using json formatting for the logs should also make them more structured/parseable ๐Ÿ‘

namm2 commented 1 year ago

HI @jszwedko could this issue be added in a roadmap? I also think the error message should be a JSON string so it'll be easier to be parsed, categorized, and analyzed.