Define a logging convention

benoit74 commented 7 months ago

We have not yet formalized a policy around logging, and I think this is clearly missing. It looks like a kind of convention is proposed in https://github.com/openzim/python-scraperlib/blob/main/src/zimscraperlib/logging.py but I do not find it very versatile and forces quite a lot of stuff (verbose dependencies, logger level is DEBUG, ...). Plus it seems to be rarely used in fact (I rarely encountered this at least).

I've read https://docs.python.org/3/howto/logging.html and there are tons of good ideas and recommendations, and this should probably be the first decision: adhere to recommendations in this document.

There are still open topics in this documentation, for which I propose decisions or suggest discussions below. WDYT? Anything else?

Live logging configuration

In order to be as versatile as possible, the software we produce must not take much decision on what needs to be logged and what is useless.

Software must hence allow to configure logging either:

by an environment variable named LOGGING_LEVEL whose value is a log level which is set in the root handler
by an environment variable named LOGGING_FILE_CONFIG whose value is the path to a configuration in configuration file format.

It should also support configuration by an environment variable named LOGGING_YAML_CONFIG whose value is the path to a YAML configuration adhering to the configuration dictionary schema

While LOGGING_YAML_CONFIG is the preferred solution (most versatile and less verbose), this is optional because it forces the use of PyYAML to parse the configuration file (and we might want to not include this in all projects).

If multiple LOGGING_* environment variables are set, the precedence is LOGGING_YAML_CONFIG > LOGGING_FILE_CONFIG > LOGGING_LEVEL.

We do not recommend to continue to support older environment variables for backward compatibility, or at least they must have even less precedence than the LOGGING_* ones.

Software must provide a sensible default configuration, typically with a default dict_config embedded in the code.

Loggers

Each software and library must use a unique logger with a unique and easily identifiable name, such as the __name__ of their top-level package or module. They must not log directly to the root logger.

This is meant to help the software / library user configure the logging verbosity or handlers as they wish while maintaining simplicity. Using a unique logger per software and library is deemed sufficient and avoids by default logging configurations that rely too heavily on module names. Should a specific need arise (e.g. filter out one kind of very verbose logs), it is always possible to achieve this with custom filters based on pathname or log content for instant. Experience shows we are usually lazy to do this anyway and long logger names are in fact just clogging the logs.

rgaudin commented 7 months ago

but I do not find it very versatile and forces quite a lot of stuff (verbose dependencies, logger level is DEBUG, ...). Plus it seems to be rarely used in fact (I rarely encountered this at least).

It's used in most scrapers, as can be seen on Github and doesn't prevent anything. It's a set of scraper-tailored defaults which I think is the good approach.

Most of our projects have very simple logging needs:

CLI in general and scrapers just want to log to stdout/stderr with a configurable log-level
libraries want to log without handler so that user can decide what to do.

I don't see any realistic, current use case for an external log config file which is a very intrusive change. Please convince me otherwise 🤓

benoit74 commented 7 months ago

I really never noticed this was used in so many places, sorry about that. And I'm now sure it is not used outside scrapers (while it is very important AFAIK for zimfarm, cms, metrics, ...).

The scraper-tailored defaults are anyway meaningful and useful.

The limitation I find in this approach is that it forces us to have only one log level for all logs. And it disables the verbose logs of some deps (which are indeed too verbose by defaults)

Should we need to have more detailed logs only for some deps or have more details about one verbose dep, it is not possible to achieve this without modifying the source code, and it is not even straightforward, I'm not sure how I should do it for instance (copy/paste the shared code in my scraper, do the modifications needed and replace the import?).

Maybe this need is only very occasional and I shouldn't mind, but I'm not comfortable yet.

I would be probably more comfortable with at least this same shared defaults embedded in a dictionary so that it can be more easily modified.

benoit74 commented 7 months ago

It is used in CMS, sorry again

rgaudin commented 7 months ago

I really never noticed this was used in so many places, sorry about that. And I'm now sure it is not used outside scrapers (while it is very important AFAIK for zimfarm, cms, metrics, ...).

Well, as its name suggest, it's a scraper lib 😉

The limitation I find in this approach is that it forces us to have only one log level for all logs.

I don't understand this. It surely defines a single handler because that's our common use case but you do have access to the logger so you can add additional handlers with different log levels.

It sounds like a special case anyway: you're not going to ask your user to pass multiple log levels and following your proposal different log levels would be for third parties anyway.

And it disables the verbose logs of some deps (which are indeed too verbose by defaults)

No, if your project (unlike most) needs debug logs for urllib3, you can still bring them back

logging.getLogger("urllib3).setLevel(logging.DEBUG)

Should we need to have more detailed logs only for some deps or have more details about one verbose dep, it is not possible to achieve this without modifying the source code

This can be phrased as “I want user to manage logging completely” which is fine but should be done knowingly.

Maybe this need is only very occasional

Does sounds like it

I would be probably more comfortable with at least this same shared defaults embedded in a dictionary so that it can be more easily modified.

By experience, I find it less flexible, less dynamic and harder to maintain: a typo and your log operates unexpectedly. You often end-up touching it with code anyway and code has the benefit of linting.

That said, we'd benefits from written conventions at least for non-scraper.

I believe conventions should be driven by usage and experience so you try your approach in metrics and once it's matured or when we start another non-scraper project and require guidance, we can derive a non-scraper convention from it.

scraperlib can also be updated as well, for scraper needs but there I believe we need less flexibility.

benoit74 commented 7 months ago

You convinced me that current approach is sufficient AND that we miss documentation around typical usage ^^

I just do not knew how to achieve stuff I wanted to do with current approach. I will experiment (and ask if there is something I do not achieve) and write something around all this in the wiki.

openzim / _python-bootstrap

Define a logging convention #30

Live logging configuration

Loggers