kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.91k stars 900 forks source link

Should we change the output of `session.run`? #1802

Closed noklam closed 6 months ago

noklam commented 2 years ago

Background

What's the output of session.run()? Currently, this is not clear as you think and it isn't documented anywhere. The logic is defined in runner.py, this can be counter-intuitive in some cases, is there a good reason why we want to do this?

https://github.com/kedro-org/kedro/blob/f4914201d7f6f38318c2c1f074fcdf802b3e1e0d/kedro/runner/runner.py#L78-L91

kedro has improved a lot in terms of how to run the pipeline with packaging & KedroSession as a standalone application, #1423 documents different ways to do it. Personally, I think it is still not easy enough to integrate with kedro for someone who is inexperienced with kedro. In #1423, It mentioned how a pipeline can be called programmatically. Even though the pipeline itself is a function call, it doesn't behave like a function, i.e. you can't really define an input as an argument easily (it has to be a Catalog entry), the output of the pipeline is also very restricted.

Motivation

Kedro works really well within the kedro world, but it also mean that kedro works very differently from the rest of the Python world.

This issue mainly focuses on the output side, this will improve the experience to integrate the kedro pipeline as an upstream. In a over-simplified world, this should be straight forward to do. Currently I think we a strong assumption that people work with "Kedro Project", but if we are moving towards a kedro package, i.e. using from kedro_package import main, it should behave just like a Python function, I think this is a reasonable expectation.

1. df = get_some_data()
2. model = my_kedro_pipeline(input={'my_pipeline_input_df': df})
3. app = PredictionWebService(model)

Questions

Related Issue:

antonymilne commented 2 years ago

This is a very interesting question. I think it's right to focus just on the output side here so I'll save my comments on input for another time 🙂

I think we'd need @idanov or maybe even @tsanikgr to explain exactly why session.run returns what it does. AFAIK it's always been this way. Intuitively it kind of feels like the right thing to me, since those are the "unprocessed" datasets which you might want to work with further. All intermediate datasets have already been consumed by the pipeline and so shouldn't be required further downstream. If you really want to make them available then you could make a mock identify node that copies them to a free output. Returning all intermediate datasets feels like too much to me.

The reason I only say kind of above is that it seems more questionable to me that we only return those outputs that are MemoryDataSet. I think there's an argument that we should return all unconsumed outputs, even if they have been persisted to disk. i.e. we could have free_output = pipeline.outputs()

Also, technically it looks to me like the code that finds free_outputs is not quite right. If I define something explicitly as a MemoryDataSet in my catalog (unusual, but not unheard of, e.g. to change the copy_mode) then it won't count as a free_output when probably it should do. It's an edge case, but worth mentioning since we're discussing it here. What free_output means in the code is just "output that's not defined in the catalog", which is a subset of "output that's not a MemoryDataSet".

noklam commented 2 years ago

Add this related SO Question - How to run a kedro pipeline interactively like a fuction - this issues only focus on the output of a pipeline, what about input? I think this will be the next question.

noklam commented 2 years ago

Notes for Tech Design

The reason I only say kind of above is that it seems more questionable to me that we only return those outputs that are MemoryDataSet. I think there's an argument that we should return all unconsumed outputs, even if they have been persisted to disk. i.e. we could have free_output = pipeline.outputs()

  1. Less controversial - Change the default - the definition of free_output is a bit buggy, we should change it.

The reason I only say kind of above is that it seems more questionable to me that we only return those outputs that are MemoryDataSet. I think there's an argument that we should return all unconsumed outputs, even if they have been persisted to disk. i.e. we could have free_output = pipeline.outputs()

  1. Open up an optional argument for session.run to return any targeted datasets - even if it's an intermediate dataset or persisted dataset - this is more useful for interactive workflow (i.e. notebook) or debugging purpose. Currently it's tricky to make it work. This one is highly related to #1832 2.1. If it's an intermediate Memory dataset - you can't really get it. 2.2. If it's an intermediate persisted dataset - you need to first session.run and then do catalog.load
merelcht commented 2 years ago

Notes from Technical Design session:

There was agreement that the "free outputs" output from session isn't very clear. It was suggested to simply return all output from nodes that is not consumed, even if it's defined in the catalog. However, this could lead to very large amounts of data being returned. Instead we'll change it to return all free outputs and additionally any MemoryDataSets that are defined in the catalog.

The second point about adding an optional argument for session.run() to return any targeted datasets was discussed briefly, but it was decided to talk about it more thoroughly in a separate workstream about node debugging.

noklam commented 2 years ago

Supplement on the above comments to address @AntonyMilneQB question:

i.e. we could have free_output = pipeline.outputs()

The answer to that is there is a catalog.load call at the end, it's an expensive call and potentially memory hungry. So persisted datasets are deleted from memory as long as they are not needed. For MemoryDataSet, it's loaded in memory already, so there is no harm to return it.

noklam commented 2 years ago

I just give it a go to see what would it takes to make the initial idea works, partly because I want to test how the nbdev system works. See DebugRunner

https://noklam.github.io/kedro-debug-runner/core.html

noklam commented 1 year ago

Adding this as inspiration on whether we should have some kind of argument or debug mode that can specifically return output easily without editing configuration.

At the moment, the proper way to inspect is

The complication is mainly due to the kedro run need to be efficient and thus some data is deleted on the fly to reduce memory footprint.

The question is how can we improve the user experience? It's hard to reason what is "free output" and what is not. I would also question that there are significant users working with moderate size of data, keeping everything in memory isn't a problem and make the development experience smoother. https://github.com/kedro-org/kedro-plugins/issues/44 Is there a way to let users do what they want without touching any configuration?