Closed noklam closed 6 months ago
This is a very interesting question. I think it's right to focus just on the output
side here so I'll save my comments on input
for another time 🙂
I think we'd need @idanov or maybe even @tsanikgr to explain exactly why session.run
returns what it does. AFAIK it's always been this way. Intuitively it kind of feels like the right thing to me, since those are the "unprocessed" datasets which you might want to work with further. All intermediate datasets have already been consumed by the pipeline and so shouldn't be required further downstream. If you really want to make them available then you could make a mock identify node that copies them to a free output. Returning all intermediate datasets feels like too much to me.
The reason I only say kind of above is that it seems more questionable to me that we only return those outputs that are MemoryDataSet
. I think there's an argument that we should return all unconsumed outputs, even if they have been persisted to disk. i.e. we could have free_output = pipeline.outputs()
Also, technically it looks to me like the code that finds free_outputs
is not quite right. If I define something explicitly as a MemoryDataSet
in my catalog (unusual, but not unheard of, e.g. to change the copy_mode
) then it won't count as a free_output
when probably it should do. It's an edge case, but worth mentioning since we're discussing it here. What free_output
means in the code is just "output that's not defined in the catalog", which is a subset of "output that's not a MemoryDataSet
".
Add this related SO Question - How to run a kedro pipeline interactively like a fuction - this issues only focus on the output
of a pipeline, what about input
? I think this will be the next question.
Notes for Tech Design
The reason I only say kind of above is that it seems more questionable to me that we only return those outputs that are MemoryDataSet. I think there's an argument that we should return all unconsumed outputs, even if they have been persisted to disk. i.e. we could have free_output = pipeline.outputs()
free_output
is a bit buggy, we should change it.The reason I only say kind of above is that it seems more questionable to me that we only return those outputs that are MemoryDataSet. I think there's an argument that we should return all unconsumed outputs, even if they have been persisted to disk. i.e. we could have free_output = pipeline.outputs()
session.run
to return any targeted datasets - even if it's an intermediate dataset or persisted dataset - this is more useful for interactive workflow (i.e. notebook) or debugging purpose. Currently it's tricky to make it work. This one is highly related to #1832
2.1. If it's an intermediate Memory dataset - you can't really get it.
2.2. If it's an intermediate persisted dataset - you need to first session.run
and then do catalog.load
Notes from Technical Design session:
There was agreement that the "free outputs" output from session isn't very clear. It was suggested to simply return all output from nodes that is not consumed, even if it's defined in the catalog. However, this could lead to very large amounts of data being returned. Instead we'll change it to return all free outputs and additionally any MemoryDataSets
that are defined in the catalog.
The second point about adding an optional argument for session.run()
to return any targeted datasets was discussed briefly, but it was decided to talk about it more thoroughly in a separate workstream about node debugging.
Supplement on the above comments to address @AntonyMilneQB question:
i.e. we could have free_output = pipeline.outputs()
The answer to that is there is a catalog.load
call at the end, it's an expensive call and potentially memory hungry. So persisted datasets are deleted from memory as long as they are not needed. For MemoryDataSet
, it's loaded in memory already, so there is no harm to return it.
I just give it a go to see what would it takes to make the initial idea works, partly because I want to test how the nbdev
system works. See DebugRunner
Adding this as inspiration on whether we should have some kind of argument or debug mode that can specifically return output easily without editing configuration.
At the moment, the proper way to inspect is
session.run
session.run
which make the "targeted dataset" as "free output"The complication is mainly due to the kedro run
need to be efficient and thus some data is deleted on the fly to reduce memory footprint.
The question is how can we improve the user experience? It's hard to reason what is "free output" and what is not. I would also question that there are significant users working with moderate size of data, keeping everything in memory isn't a problem and make the development experience smoother. https://github.com/kedro-org/kedro-plugins/issues/44 Is there a way to let users do what they want without touching any configuration?
Background
What's the output of
session.run()
? Currently, this is not clear as you think and it isn't documented anywhere. The logic is defined inrunner.py
, this can be counter-intuitive in some cases, is there a good reason why we want to do this?https://github.com/kedro-org/kedro/blob/f4914201d7f6f38318c2c1f074fcdf802b3e1e0d/kedro/runner/runner.py#L78-L91
kedro
has improved a lot in terms of how to run the pipeline with packaging &KedroSession
as a standalone application, #1423 documents different ways to do it. Personally, I think it is still not easy enough to integrate withkedro
for someone who is inexperienced with kedro. In #1423, It mentioned how a pipeline can be called programmatically. Even though the pipeline itself is a function call, it doesn't behave like a function, i.e. you can't really define an input as an argument easily (it has to be a Catalog entry), theoutput
of the pipeline is also very restricted.Motivation
Kedro works really well within the kedro world, but it also mean that kedro works very differently from the rest of the Python world.
This issue mainly focuses on the
output
side, this will improve the experience to integrate thekedro
pipeline as an upstream. In a over-simplified world, this should be straight forward to do. Currently I think we a strong assumption that people work with "Kedro Project", but if we are moving towards a kedro package, i.e. usingfrom kedro_package import main
, it should behave just like a Python function, I think this is a reasonable expectation.Questions
session.run
?Things to consider
Related Issue:
1423 Is trying to improve/simplifies how we run kedro pipeline as a standalone package
795 Discussion of how to use kedro as upstream/downstream