kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.47k stars 875 forks source link

[DataCatalog]: Disable `catalog.load()` logging #3918

Closed ElenaKhaustova closed 5 days ago

ElenaKhaustova commented 1 month ago

Description

catalog.load() produces extensive logging information when loading multiple datasets in the same notebook cell. These logger messages annoy users and cannot be disabled with the current implementation.

Context

"It’s very convenient to load my dataset when debugging locally. The only annoying thing is the multiple logger.info messages I get when I load multiple datasets in the same notebook cell."

Possible Implementation

  1. Add a boolean argument to catalog.load() to disable logging prints for cleaner outputs.
  2. Consider configuring this at logger level and suggest an approach.
  3. Explore if the amount of logging information during the catalog.load() can be reduced.
astrojuanlu commented 3 weeks ago

Relaying internal discussion:

@astrojuanlu: Rather than a boolean argument, could this be configured at the logger level instead? @ElenaKhaustova: I think the problem is that users want to silent this particular method rather that disable logging at all.

merelcht commented 3 weeks ago

I’m not sure we should have a custom approach for dealing with logging for dataset loading compared to other log messages.

ElenaKhaustova commented 1 week ago

The amount of logging information during the catalog.load() does not look extensive and should not be changed IMO. However, I can imagine cases with tens of datasets where this INFO output will pollute a cell.

Screenshot 2024-06-25 at 13 54 35

I also agree that adding a custom flag just for one method - catalog.load() is not the best solution.

Based on the above we can consider the following:

  1. Suggest using a custom logging config and setting a desired logging level. The drawback is that it changes global logger settings.
  2. Suggest changing a logger level at runtime and then resetting it to ensure it doesn't affect other modules. This might look a little bit complicated from the user's perspective.
    
    import logging

Change logging level

logger_level = logging.getLogger("kedro").level logging.getLogger("kedro").setLevel(logging.WARNING)

for ds in catalog.list(): _ = catalog.load(ds)

Reset the level

logging.getLogger("kedro").setLevel(logger_level)


3. Make the `_logger` property public so that users can modify it and set the desired logging level just for the catalog component rather than globally. In that case they would not need to reset it.
https://github.com/kedro-org/kedro/blob/413dbca46a9cb8e29fc1761564c4e13e2efe4d8e/kedro/io/data_catalog.py#L201
astrojuanlu commented 1 week ago

https://github.com/kedro-org/kedro/blob/413dbca46a9cb8e29fc1761564c4e13e2efe4d8e/kedro/io/data_catalog.py#L201-L203

This is a pattern I've never seen. And yet, can't the user do

logging.getLogger("kedro.io.data_catalog").setLevel(logging.WARNING)

? (Haven't tested this, unsure if it works)

ElenaKhaustova commented 1 week ago

@astrojuanlu

Yes, that works and doesn't affect the top-level loggers! So I believe that should be the solution and no changes are required as all of them lead to the same amount of effort from the user side.

Closing the issue.

datajoely commented 1 week ago

So I still think we should make this easier for the user:

ElenaKhaustova commented 1 week ago

So I still think we should make this easier for the user:

  • Docs at the very least
  • logging.yml setting?

Good point, I will extend the section on how to customise logging then. But I would not change the default logging.yml as having INFO level seems reasonable at the start.

merelcht commented 5 days ago

Completed in #3968