kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.47k stars 875 forks source link

Investigate performance of config loading for big projects #3893

Open astrojuanlu opened 1 month ago

astrojuanlu commented 1 month ago

Description

Earlier this week a user reached out to me in private saying that it was taking 3 minutes for Kedro to load their configuration (KedroContext._get_catalog).

Today another user mentioned that "Looking at the logs, it gets stuck at the kedro.config.module for more than 50% of the pipeline run duration, but we do have a lot of inputs and outputs"

I still don't have specific reproducers, but I'm noticing enough qualitative evidence to open an issue about it.

datajoely commented 1 month ago

I'd like to see us add a CLI command which users can run to produce a flamegraph. It would massively reduce the guesswork here.

kedro profile {kedro command} -> .html / .bin

yury-fedotov commented 1 month ago

I'd like to see us add a CLI command which users can run to produce a flamegraph. It would massively reduce the guesswork here.

kedro profile {kedro command} -> .html / .bin

@datajoely flamegraph for the entire pipeline run (how much time each node takes) or just the config resolution / pipeline initialization?

datajoely commented 1 month ago

In my mind, it would run the whole command as normal, but also generate the profiling data.

Perhaps if we were to take this seriously, a full on memray integration would incredible.

astrojuanlu commented 1 month ago

Continuing the discussion on creating custom commands here https://github.com/kedro-org/kedro/discussions/3908

astrojuanlu commented 1 day ago

Many users have been complaining about the slowness of Kedro with big projects and that can be attributed to many different causes. However one of the most prevailing cause is big parameter files that get expanded into hundreds of datasets on their own. That process takes a lot of time and if the files become too big (a couple of MB), it presents as significant slowdown.

Originally posted by @idanov in https://github.com/kedro-org/kedro/pull/3732#issue-2202743592

The solution works, but couples the DataCatalog with OmegaConf is still under review.

From the discussion in the PR:

Shouldn't we redesign the DataCatalog API instead so that parameters are first class citizens, and not fake datasets?

There were a few thumbs up to the idea, and it was brought up again in https://github.com/kedro-org/kedro/pull/3973 (@datajoely please do confirm that this is what you had in mind 😄)

@merelcht pointed out that there's a pending research item on how users use parameters and for what #2240

@ElenaKhaustova agreed that this is relevant in the context of the ongoing DataCatalog API redesign #3934.

Ideally, if there's a way we can tackle this issue without blocking it on #2240, the time to look at it would be now. But I have very little visibility on what are the implications, or whether we would actually solve the performance problem at all. So, leaving the decision to the team.

merelcht commented 1 day ago

The solution works, but couples the DataCatalog with OmegaConf

Would you really call this coupling? The way I read it is that is uses omegaconf to parse the parameters config. We already have a dependency on omegaconf anyway, and I actually quite like that we can leverage it in more places than just the OmegaConfigLoader itself. I would have called it coupling if it uses the actual OmegaConfigLoader class, but this just imports the library.

astrojuanlu commented 1 day ago

Sorry to keep moving the conversation but I'd rather not discuss the specifics of a particular solution outside the corresponding PR, addressed your question in context at https://github.com/kedro-org/kedro/pull/3732#discussion_r1662727430