kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.51k stars 877 forks source link

Change the `default_logging.yml` and `logging.yml` to have more sensible default #3687

Open noklam opened 4 months ago

noklam commented 4 months ago

Split out from #3591

Context

I did a demo a while ago showing how frustrating it is to try to change logging level. With #3446 and this ticket, it will make customise logging easier for our users.

Problem

https://github.com/kedro-org/kedro/blob/da709d4316c141c5a7d6f676a87a5752807b33f4/kedro/templates/project/%7B%7B%20cookiecutter.repo_name%20%7D%7D/conf/logging.yml

There are many level: INFO settings in the template, one may expect changing them to see more verbose logging. The consequence is that you need to change multiple INFO to DEBUG in order to see the DEBUG level message. So we basically provide a knob that doesn't change anything (technically it does, but it's most likely not what our user need, and for advance users they can figure out how to do advance filtering)

Proposal

https://github.com/kedro-org/kedro/blob/da709d4316c141c5a7d6f676a87a5752807b33f4/kedro/templates/project/%7B%7B%20cookiecutter.repo_name%20%7D%7D/conf/logging.yml#L11-L16

  1. Remove line 14, which is unnecessary and make it's harder to use logging.yml

https://github.com/kedro-org/kedro/issues/3446#issuecomment-1979711477

+1 on setting the default level of the Kedro logger to INFO (if just for backwards compatibility), and then having -q set it to WARNING and -qq to ERROR I don't think the current logging.yml logic is the only way to achieve that though I'm also ambivalent on whether we should change the global logging level to INFO -1 on keeping the current logging.yml logic - whoever wants fine grained control of logs, file logging, rotation etc should be using journald, supervisor, Datadog, or whatever other solution. this is not Kedro's responsibility @astrojuanlu

  1. How to customise Kedro or other packages logging level Use case: As a plugin developer, I want to see my logging in kedro project during kedro run

If we do 1., this will be basically adding addition logger in loggers section, but there is also a problem how plugins can do this easily or maybe it should be done at the package level. This can actually solved by #3591, advance settings will remains the same, which is adding a new loggers or setting this with package level logging.

I don't have a better solution than the current one yet. Here are things that we know:

noklam commented 4 months ago

-1 on keeping the current logging.yml logic - whoever wants fine grained control of logs, file logging, rotation etc should be using journald, supervisor, Datadog, or whatever other solution. this is not Kedro's responsibility

@astrojuanlu While I agree it's probably not what Kedro should do, it does helps the developing experience, alternatively we will need some kind of progress bar as that's why Kedro INFO log are doing roughly. Plus I don't see a big problem keeping logging.yml, is there any major benefit moving away from logging.yml? Changing logging.yml is easier we can do it in a non-breaking way in 0.19.x.

astrojuanlu commented 3 weeks ago

Adding some color to my earlier statements on OpenTelemetry, logging etc:

OpenTelemetry seems to be quite mature for traces (as pioneered by OpenTracing), metrics (Prometheus, the former OpenCensus) but not so much for logs. In fact, the client APIs for logging in Python are in development and seemingly unstable:

image

While signals are in development, breaking changes and performance issues MAY occur. Components SHOULD NOT be expected to be feature-complete. In some cases, the signal in Development MAY be discarded and removed entirely. Long-term dependencies SHOULD NOT be taken against signals in Development.

In fact, there seem to be some inconsistencies still.

Looks like good practice nowadays involves having a log collector (Promtail, Fluentd, Logstash, Grafana Agent Alloy) that then send logs to a service (Loki, Elasticsearch).

The dream of having apps just log JSON to stdout is actually spelled in the structlog docs:

Colorful and pretty printed log messages are nice during development when you locally run your code.

However, in production you should emit structured output (like JSON) which is a lot easier to parse by log aggregators.

A simple but powerful approach is to log to unbuffered standard out and let other tools take care of the rest.

That can be your terminal window while developing; it can be systemd redirecting your log entries to syslogd and rotating them using logrotate; or it can be your cluster manager forwarding them to an obscenely expensive log aggregator service.

So I still think that we shouldn't have a too heavy handed approach to logging, but I now have more context on how this is actually achieved, and what to expect from the current ecosystem.