kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.82k stars 895 forks source link

Extract insights from configuration user interviews #1847

Closed merelcht closed 1 year ago

merelcht commented 1 year ago

Introduction

We conducted user interviews with eight users about configuration. The aim of the interview sessions was to really understand user experience of using configuration in Kedro, and specifically experience with the AbstractConfigLoader, ConfigLoader, and/or TemplatedConfigLoader and potentially any customisation they've had to do to any of those classes. We'd like to understand reasons users have for customising any parts of configuration, and the experience of creating such custom implementations.

Analysis methodology

The interviews were recorded, uploaded to Dovetail and automatically transcribed. I went through all interviews and tagged them to highlight what topics and themes were mentioned. Once I finished tagging, I looked for common themes and created insights out of those. The insights are summarised below, with samples from quotes. The full quotes and interview data are saved on Dovetail.

Insights from Configuration user interviews

  1. Users like TemplatedConfigLoader (4/8)

    The users we interviewed were very positive about TemplatedConfigLoader. The majority of them have used it. It also appears that most custom config loaders are built on top of TemplatedConfigLoader rather than on top of ConfigLoader. Users who use TemplatedConfigLoader actively said it’s pretty much complete. The only thing that was mentioned as missing is chaining of globals.

    User quotes: "I really like using the TemplatedConfigLoader because it allows you to basically have a set of defaults config that you want to use and then just change small parts of it via a templating".

    "No one wants to go through every file and do a find and replace. That should be a nicer way to do it. And TemplatedConfigLoader is the default nice way to do it."

  2. Users use TemplatedConfigLoader by default (2/8) Of the people interviewed, most were using TemplatedConfigLoader and two even assumed it was default (or should be).

    User quotes: "In fact, until a few weeks ago, I didn't know I was using the TemplatedConfigLoader. I thought this was just the ConfigLoader. "

    "[The] default [on projects] really should be TemplatedConfigLoader just because a modifying base paths is terrible"

  3. Users who care about reproducibility don't like using extra parameters (4/8) About half of the people interviewed use extra_params and half don’t. Several users said they don’t like using it because it’s hard to reproduce the results from a run that had extra parameters. Others expressed this method shouldn’t be used in production, but more for experimentation.

    User quotes: On why he doesn't use extra parameters: "Because [of] reproducibility, you can’t basically commit the stuff you have in the command line arguments, you commit them if they’re in a config file."

    "So I just find it more, more error prone if you're typing stuff in the CLI yeah. It's better to, you have commands that you've checked and you know, works"

  4. Users inject runtime parameters in other ways than using extra parameters (4/8) Users have a need to provide parameters (and other configuration) at runtime, some use extra_params for this, but people use other ways as well.

    User quotes: "if I had to do that [provide runtime parameters], I would say put it into a make file"

    "We were also using globals to pass credentials. So things we wanted to pass credentials at run time"

  5. Users do not like Jinja, but it's necessary sometimes (3/8) Jinja2 is used by people even though they don’t like it very much. They mainly use it out of a need to manage really large amounts of config and in scenarios with “multi-country multi-brand pipelines”.

    User quotes: "To be honest, the only reason I use Jinja is because some of our, some of our engineers in vertical InsureX use it. If you ask me, I will never take Jinja, it's too extensive. Like you should not have such powerful engine templating engine inside configuration"

    "I don't like Jinja so much about it, but you don't have any other solution for having like this multi-country multi brand kind of pipelines so far. "

    "We know that Jinja2 isn't pretty, but it does solve a problem that currently there is no solution for, but if we can find a nicer way, then it's a way to move on."

  6. It is hard to reason about templated configuration (4/8) When using templating in configuration users find it hard to verify what the values actually will be at run time. The config loaders in Kedro do rendering and loading in one step. Hydra offers a solution that compiles all config so you can check what values are used at runtime.

    User quotes: "if you are using templates, it creates additional layer of complexity. [...] for example, you are a CST and you receive from InsureX some kind of like pipeline and [...] you receive a lot of like Jinja loops, Jinja if else conditions, some imports, some weird stuff. And instead of just having at least basic overview of what the landscape of configuration for this pipeline, you need to go actually check, [...] what actually will be filled in the system [...] to understand like what will be resolved by your system."

    It would be helpful: "being able to print off the actual file path [...] So because we're constructing these dynamically and you have the environment at play as well. Sometimes when you're syncing files to a remote instance, you miss out something and then you run it and it's not the file path you thinking it is. But being able to log the actual file path as you run, like when you load or when you write that will be helpful."

  7. Users struggle to scale configuration (5/8) With "scaling" I mean two things here:

    1. Having lots of config
    2. Having the same, but slightly different, config for multiple settings: e.g. multi-country, multi-brand, multi-model

    User quotes: "how do we manage configuration across so many brands and countries, or just generically across so many pipelines?"

    "They had multiple options models for, for one pipeline around the study. And his biggest problem was also to maintain it correctly and to test it"

    "So the practical problem is that the parameters that the user changes are a small fraction of all the stuff that can be parameterized in the code and then the config files explode. And then it's so difficult to know what to change"

  8. Globals are regarded as "source of truth" (2/8) User quotes: "There is a project where you could put in like constant as well. So if I have different brands, I could put them into global so that I can reference them so that I don't have typos in my filepaths in my, in my entries."

    "You may have a risk of people are doing additional parameters files here. []...] I basically say, no, [...] Let’s use these globals files to that, we only have one source of truth."

  9. Users have config in other places than the default config directory (4/8) In Kedro we assume that configuration is inside a config directory in the project. However, several users mentioned they have configuration in other places as well and thus have had to customise or make workarounds to load this configuration.

    User quotes: "We were doing config loading on my project outside of Kedro. [...] we had a set up where we had a Kedro project for parts of the code repo, and then some that was [...] an application. And so all the application logic is inside another module, which is not inside the Kedro project structure."

    " [...] just overriding where the configuration started from using Kedro-Glass like, not just from the conf folder at the root of the project, but inside the source folder also. "

  10. Users struggle to pass configuration when using kedro package This insight actually came out of the Databricks user research, but is directly related to configuration.

    User quotes: "the biggest problem with whl-based approach is our packaging-based approaches. We have to move our configuration to a different folder and getting it right was a bit tricky [...]"

    "I think packaging experience could definitely be improved a bit. And this is also one of the questions, like some other data scientists, not, not on the project asked, “Why is conf/ out of the packaging?”"

  11. Users store credentials in environment variables (3/8) The majority of the users that make use of environment variables in Kedro projects, use them to store credentials. For some users that is actually the only way they pass credentials. Most users said they don’t use credentials.yml

    User quotes: "putting everything in your environment, variable and making that available, this is super useful for, for credentials and so on just to have it by default and we kind of paired this. So you need a way to access our environment variables"

  12. Some users want to use environment variables for configuration (4/8) Some users use environment variables for other things than credentials.

    User quotes: When asked why they customised configuration: "[To allow] Being able to use environment variables"

  13. Some users don't use environment variables in Kedro at all (4/8) User quotes: "I don't prefer reading those kinds of variables in my Kedro pipeline. Firstly, like whatever variables, global variables I want to use, I will put it in my global.yaml file [...] So I haven't come across the use case where I would do like a, something from like a dollar bash, a us thing or something like that."

  14. Users don't use AbstractConfigLoader (8/8) AbstractConfigLoader wasn’t mentioned by any of the people we interviewed.

  15. Users struggle to change the default ConfigLoader with 0.18.0 (2/8) In Kedro 0.17.+ users could change their default ConfigLoader through hooks, now they have to use settings.py and not all customisations that were possible are possible anymore.

    User quotes: "Unfortunately, you need to override a lot of things. When in the previous version, the only thing that they needed to override is hooks"

    Referring to an implementation where runtime credentials were passed to the globals dict in the registration hook. "So this was something that I felt might not be possible with 18 because we have an extra quarter injection or something that's happening before even passing some here."

  16. Some users like flexibility to choose what configuration format to use (4/8) Several users we interviewed mention using or seeing a use case for using config formats other than yaml.

    User quotes: When talking about pain points with the Kedro ConfigLoader: "And the second one is the fact that the config file is a YAML file"

    "If they [kedro users] want to bring a .py file and leverage the full power of python, into manage their configuration. Is this something we do want to block or not? [...] You should probably just let the user decide what's best for them."

noklam commented 1 year ago

Brain dumps again.

1.Users like TemplatedConfigLoader (4/8) Why is it not default already?

3.It is hard to reason about templated configuration (4/8) 6.Users who care about reproducibility don't like using extra parameters (4/8)

I think these 2 are kind of related, if there is a proper way to generate a compiled version of configs then I think this could be solved. Runtime/Dynamic config is good for experimentation, but it is always good to have the final compiled version for reproducibility.

  1. is partly solved with Experiment tracking usually, together with git hash, data versioning.
  1. Users struggle to pass configuration when using kedro package

795, #1423 related I think.

  1. Some users like flexibility to choose what configuration format to use (4/8)

I think Python over YAML is mostly due to dynamic config/pipeline, as Python is much more flexible to add logic to compose the configuration which is simple impossible with YAML. (Do we want to enable it even if it's possible?)

  1. Some users want to use environment variables for configuration (4/8)

This one looks straight forward, in 0.17.x it's probably very easy to inject it, not so sure about 0.18.x

  1. Users store credentials in environment variables (3/8)

I think this is related to the Universal Deployment thread, credentials is only useful for a local project, but not so much as you move into production which makes it less useful. Especially when you move to docker or Cloud service which has a dedicated credential manager.

merelcht commented 1 year ago

Closing this as the research is now completed. Follow up issues will be created.