Open astrojuanlu opened 1 month ago
I'd like to see us add a CLI command which users can run to produce a flamegraph. It would massively reduce the guesswork here.
kedro profile {kedro command}
-> .html
/ .bin
I'd like to see us add a CLI command which users can run to produce a flamegraph. It would massively reduce the guesswork here.
kedro profile {kedro command}
->.html
/.bin
@datajoely flamegraph for the entire pipeline run (how much time each node takes) or just the config resolution / pipeline initialization?
In my mind, it would run the whole command as normal, but also generate the profiling data.
Perhaps if we were to take this seriously, a full on memray integration would incredible.
Continuing the discussion on creating custom commands here https://github.com/kedro-org/kedro/discussions/3908
Many users have been complaining about the slowness of Kedro with big projects and that can be attributed to many different causes. However one of the most prevailing cause is big parameter files that get expanded into hundreds of datasets on their own. That process takes a lot of time and if the files become too big (a couple of MB), it presents as significant slowdown.
Originally posted by @idanov in https://github.com/kedro-org/kedro/pull/3732#issue-2202743592
The solution works, but couples the is still under review.DataCatalog
with OmegaConf
From the discussion in the PR:
Shouldn't we redesign the
DataCatalog
API instead so that parameters are first class citizens, and not fake datasets?
There were a few thumbs up to the idea, and it was brought up again in https://github.com/kedro-org/kedro/pull/3973 (@datajoely please do confirm that this is what you had in mind 😄)
@merelcht pointed out that there's a pending research item on how users use parameters and for what #2240
@ElenaKhaustova agreed that this is relevant in the context of the ongoing DataCatalog
API redesign #3934.
Ideally, if there's a way we can tackle this issue without blocking it on #2240, the time to look at it would be now. But I have very little visibility on what are the implications, or whether we would actually solve the performance problem at all. So, leaving the decision to the team.
The solution works, but couples the
DataCatalog
withOmegaConf
Would you really call this coupling? The way I read it is that is uses omegaconf to parse the parameters config. We already have a dependency on omegaconf anyway, and I actually quite like that we can leverage it in more places than just the OmegaConfigLoader
itself. I would have called it coupling if it uses the actual OmegaConfigLoader
class, but this just imports the library.
Sorry to keep moving the conversation but I'd rather not discuss the specifics of a particular solution outside the corresponding PR, addressed your question in context at https://github.com/kedro-org/kedro/pull/3732#discussion_r1662727430
Description
Earlier this week a user reached out to me in private saying that it was taking 3 minutes for Kedro to load their configuration (
KedroContext._get_catalog
).Today another user mentioned that "Looking at the logs, it gets stuck at the kedro.config.module for more than 50% of the pipeline run duration, but we do have a lot of inputs and outputs"
I still don't have specific reproducers, but I'm noticing enough qualitative evidence to open an issue about it.