kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.82k stars 895 forks source link

Packaged kedro pipeline does not work well on Databricks #1807

Closed noklam closed 4 weeks ago

noklam commented 2 years ago

Originated from internal users https://quantumblack.slack.com/archives/C7JTFSFUY/p1661236540255789

antonymilne commented 2 years ago

This is one of the things I have fixed in https://github.com/kedro-org/kedro/pull/1423. I really need to write it up so we can fix it properly (I have a better fix than that PR). Let me check the slack thread now anyway.

noklam commented 2 years ago

Add some more comments to document what happened. In summary, there are 2 issues:

  1. The CLI mode generate some sys.exit() which some platform doesn't welcome (i.e. databricks), and users must revert to using KedroSession for a packaged kedro library.
    
    from kedro.framework.session import KedroSession
    from kedro.framework.project import configure_project
    package_name = <your_package_name>
    configure_project(package_name)

with KedroSession.create(package_name) as session: session.run()


2. Related to #1728, which `rich` override the `ipython._showtraceback`, and databricks will ignore any Python `Exception`, and treat a "databricks job" as success even if the Python program is terminated with exceptions.
antonymilne commented 2 years ago

A few more notes on this. The sys.exit problem is already known. This is actually a problem completely independent of databricks and happens just in a normal IPython session (which makes it much easier to try out). It's discussed further in #1423. There are ways around it that don't need you to invoke KedroSession explicitly. It basically boils down to invoking click in such a way that it doesn't do the sys.exit:

Adventures with click

I opened an issue on the click repo (which didn't go down well... 😅) describing the original source of problems 1, 2, 3. In short, when you run a click command it always exits afterwards with an Exception, even if execution was successful. This means it's impossible to run any code after that main in a Python script and also massively confuses ipython, which blows up.

For the record, some of the things that don't quite fix this are:

  • run(standalone_mode=True) like suggested on the click repo. This solves problem 2 and 3 but not 1
  • run.callback. This nearly gets you there but doesn't fill in the default arguments for you so you need to explicitly give all the arguments in the function call, which is horrible
  • run.main is equivalent to what we were doing before (run.__call__) so doesn't help
  • run.invoke is really close to doing it, but unlike run.forward doesn't fills in default arguments and won't notice CLI arguments when called with python -m spaceflights --pipeline=ds

Problem 2 is fixed by https://github.com/kedro-org/kedro/pull/1769:

A fix for the issue raised by @pstanisl in https://github.com/kedro-org/kedro/issues/1728. Hopefully this will be fixed on the rich side in https://github.com/Textualize/rich/issues/2455 in due course, but for now I'm fixing it here because it's proving problematic, e.g. for @PedroAbreuQB. Jobs on databricks that should be detected as Failed and currently marked as Succeeded.

The solution is just to disable rich.traceback.install on databricks.

astrojuanlu commented 1 year ago

I opened an issue on the click repo (which didn't go down well... 😅)

That's... disappointing. See a previous report of a similar issue https://github.com/pallets/click/issues/1054 (2018, locked in 2020)

I would like to better understand the context of this issue. First, I tried creating a barebones Click CLI:

$ cat click_adventures.py
"""
Adventures with Click.
"""

__version__ = "0.1.0"

import click

@click.command()
def cli():
    print("Hello, world!")

if __name__ == "__main__":
    cli()
$ cat pyproject.toml
[build-system]
requires = ["flit_core >=3.2,<4"]
build-backend = "flit_core.buildapi"

[project]
name = "click_adventure"
authors = [{name = "Juan Luis", email = "juan_luis_cano@mckinsey.com"}]
dependencies = [
  "click",
]
dynamic = ["version", "description"]

[project.scripts]
click_adventure = "click_adventure:cli"

And it looks like running it using %sh in Databricks works just fine:

image

Next, I packaged the default spaceflights starter with Kedro 0.18.9 (which, by the way, includes things like kedro-viz and jupyterlab as dependencies... hopefully fixed when we address #2519) and installed it on Databricks. Then I uploaded the config and the data files, and everything seemed to work fine too:

image

What didn't work was running the project from Python, as advised in our docs https://docs.kedro.org/en/stable/tutorial/package_a_project.html#run-a-packaged-project

image ... image

Could this issue by addressed by

  1. Changing the official recommendation and
  2. Making the CLI functions very thin wrappers over non-Click functions that do the actual work? Essentially:
def run(...):
    ...

@click.command()
def run_command(args):
    click.echo(run(args))

(from https://github.com/pallets/click/issues/1054#issuecomment-400347771)

I just started using Kedro on Databricks, so please excuse my innocence and possibly candid calls for a solution - I'm sure everyone has tried everything in the book, and I'm also sure I'm about to discover more weirdnesses.

astrojuanlu commented 1 year ago

This is an ongoing issue https://www.linen.dev/s/kedro/t/13153013/exit-traceback-most-recent-call-last-file-databricks-python-#14d50244-c91a-4738-b0de-11b822eb181b

Looking at the traceback, this is the code the user was trying to run:

params_string = ','.join(list_params)
main(
    [
        "--env", conf,
        "--pipeline", pipeline,
        f"--params={params_string}"
    ]
)

And filling the gaps, I think the main function does something like this:

run = _find_run_command(package_name)
run(*args, **kwargs)

which resembles https://github.com/kedro-org/kedro/blob/0.18.11/kedro/templates/project/%7B%7B%20cookiecutter.repo_name%20%7D%7D/src/%7B%7B%20cookiecutter.python_package%20%7D%7D/__main__.py

So, most likely the user is doing

from kedro_project.__main__ import main

main()

This touches on https://github.com/kedro-org/kedro/issues/2384 and https://github.com/kedro-org/kedro/issues/2682. I think we should address this soon.

astrojuanlu commented 1 year ago

In line with what I said in https://github.com/kedro-org/kedro/issues/2682#issuecomment-1593075949,

astrojuanlu commented 10 months ago

This bug report is more than 1 year old. Can we get an updated view of what's happening here?

We have 3 official workflows for Databricks:

On the other hand, in this page https://docs.kedro.org/en/stable/tutorial/package_a_project.html we're telling users to do this:

from <package_name>.__main__ import main

main(
    ["--pipeline", "__default__"]
)  # or simply main() if you don't want to provide any arguments

which in my opinion is conceptually broken, because we are abusing a CLI function to run it programmatically.

@noklam called this above a "downgrade", but what's the problem with running

with KedroSession.create(package_name) as session:
    session.run()

? Should we update our docs instead?

@idanov's #3191 seems to make that CLI function use click standalone mode conditionally, but in my opinion (1) it's a hack and (2) doesn't really help us pinpoint why our users are running code that is broken instead of following the instructions from our docs.

What am I missing?

[^1]: Note that dbx is now deprecated in favour of Databricks Asset Bundles, this change hasn't made it to our docs yet https://github.com/kedro-org/kedro/pull/3212/

noklam commented 10 months ago

@astrojuanlu The important thing here is "Packaged project" and "Kedro Project"

In a simplified world, , python -m <package> or the main function is basically bootstrap_project + KedroSession.run. There are some minor differnece (i.e. cli.py). Basically session is used for non-package project and main is used for package project.

The benefit of having an entrypoint is that most deployment system support this because it is just python -m. To use session.run you requires a custom python script for execution.

@noklam called this https://github.com/kedro-org/kedro/issues/1807#issuecomment-1228342162 a "downgrade", but what's the problem with running

with KedroSession.create(package_name) as session: session.run() See https://github.com/kedro-org/kedro/pull/1423#issuecomment-1228944699, originally I even propose kedro run --package_name=xxx

Using Databricks jobs https://docs.kedro.org/en/stable/deployment/databricks/databricks_deployment_workflow.html and here we are telling users to write a custom entry point, supposedly because of this issue. But isn't that workaround enough?

For this, I think this is a bad workaround as it shouldn't need a special entrypoint, https://github.com/kedro-org/kedro/pull/3191 should replace this. Assuming you didn't start with databricks-iris, you will need to do those copy & paste. This step can be removed because any Kedro project can share ONE entrypoint.

Honestly I don't hate KedroSession because users are already familar with it, and they can stick with it for both a normal Kedro Project or a packaged one. I see that there are some extra benefit for the entrypoint and keeping things consistently.

I think we need to be pragmatic here, whether or not CLI is the best way to do this is arguable. This will requires more discussion and design. Unless we plan to change this in short term. I think we should at least move forward with #3191 and continue this CLI vs click wrapper option.

astrojuanlu commented 10 months ago

The important thing here is "Packaged project" and "Kedro Project"

In a simplified world, python -m <package> or the main function is basically bootstrap_project + KedroSession.run. There are some minor differnece (i.e. cli.py). Basically session is used for non-package project and main is used for package project.

Thanks, that's helpful. So this issue concerns the Databricks workflows that use packaged projects only, hence "Databricks jobs" at the moment. Please correct me if I'm wrong.

For this, I think this is a bad workaround as it shouldn't need a special entrypoint

I agree 💯 I understand then that the problem is that currently users are forgetting to do that step, and they expect the default entrypoint to just work.

noklam commented 10 months ago

Thanks, that's helpful. So this issue concerns the Databricks workflows that use packaged projects only, hence "Databricks jobs" at the moment. Please correct me if I'm wrong.

It is more generic. from my_package import main wouldn't work nicely in any interactive environment (Ipython, Jupyter) because of the sys.exit. Of course you can still use the workaround by manually doing bootstrap_project

https://github.com/kedro-org/kedro/issues/3237 has more context but #3191 go beyond just Databricks, it also affect integrating Kedro with other downstream task. (the CLI way doesn't work currently, so only KedroSession.run works)

astrojuanlu commented 10 months ago

Of course you can still use the workaround by manually doing bootstrap_project

This has been my point all along: that this shouldn't be a "workaround", it should be our recommended way of doing it.

I'm not blocking #3191 on this discussion, but I'm hoping we can have it some day. Reason being: running a Kedro project programmatically requires either abusing these CLI functions, or adding some magic (bootstrap_project? configure_project? what do these functions do?) that obscure what's happening.

idanov commented 10 months ago

In the past we had a few nice .puml diagrams, showing the whole list of possible execution flows in different modes of execution, but it seems that they have been deleted from the repo. But in general, the idea is that we have two distinct regimes:

bootstrap_project is only used in the first regime, because it adds src/ or the custom directory you've decided to use to the Python import path. configure_project is the one allowing you to import things from your own project through kedro.framework.project and is essential for having a uniform way for Kedro and Kedro plugins to access project specific things like settings and registered pipelines without knowing the project package name (which is obviously always different for different Kedro projects). configure_project is what gives Kedro the django-like experience for settings.

It's probably worth documenting that for advanced users who would like to know how Kedro works internally and maybe retrieve the old .puml diagrams we've had.

astrojuanlu commented 10 months ago

.puml diagrams for reference https://github.com/kedro-org/kedro/tree/0.17.7/docs/draft

astrojuanlu commented 10 months ago

I generated a reproducer that doesn't depend on Kedro.

Source code: https://github.com/astrojuanlu/simple-click-entrypoint

Job definition:

image

Result:

image

image

astrojuanlu commented 10 months ago

Reported this upstream https://github.com/MicrosoftDocs/azure-docs/issues/116964

astrojuanlu commented 9 months ago

Did a bit more debugging https://github.com/kedro-org/kedro/pull/3191#pullrequestreview-1764364409 the sys.ps1 trick proposed there is not working. There's not enough clarity on how Databricks launches the entry point.

It's clear that passing standalone_mode=False should work - the problem is how to detect when to pass it.

astrojuanlu commented 6 months ago

The sorry state of this issue is as follows:

We need to keep brainstorming. I honestly thing the best way forward would be to simplify the default __main__ and entry point https://github.com/kedro-org/kedro/issues/3051, ideally making them independent from our Click wrappers and avoid having 5 different context-dependent ways of running a Kedro project #3237 but maybe there are other possible solutions.

noklam commented 6 months ago

Leave a comment here: https://github.com/kedro-org/kedro/issues/3237#issuecomment-1969278134. This goes beyond Databricks so I want to keep the conversation on the broader topic.

astrojuanlu commented 3 months ago

Useful bit of information I found in uv: https://github.com/astral-sh/uv/pull/3890 the JPY_SESSION_NAME environment variable is set for notebooks.

See:

(From https://stackoverflow.com/a/77904549 [^1])

For whatever reason this doesn't work on VSCode notebooks though, they set a variable called __vsc_ipynb_file__ https://github.com/microsoft/vscode-jupyter/pull/8531

Dropping this here in case it helps.

[^1]: Unlike ChatGPT, I can cite my sources 😉

astrojuanlu commented 3 months ago

Nah, none of this works on Databricks 👎🏼