ploomber / soorgeon

Convert monolithic Jupyter notebooks 📙 into maintainable Ploomber pipelines. 📊
https://ploomber.io
Apache License 2.0
78 stars 20 forks source link

Dealing with non-pure functions #17

Open edublancas opened 2 years ago

edublancas commented 2 years ago

soorgeon detects outputs by looking at the left side of an assignment expression, for example, let's say the section of a notebook looks like this:

x = 1
y, z = compute_stuff()

In the snippet above, x, y, and z will be detected as outputs, since they are all on the left side of the assignment (=) operator.

However, non-pure functions mutate their inputs: Say we have the following:

import pandas as pd

def fn(df):
    df['new_column'] = 1

df = pd.DataFrame({'column': [0]})

# problematic line!
df2 = fn(df)

Before the problematic line df, only has one column (named column), but after the problematic line, df has two columns (column, and new_column); however we currently have no way of knowing that since df does not appear to the left of the assignment operator. This might cause the output of soorgeon refactor to be incorrect.

Unfortunately, there is an infinite way to mutate variables, in fact, any method might mutate them. example:

# this may or may not mutate the input
df.do_something()

I think for now we should detect instances of:

df[something] = stuff

and display a warning

e1ha commented 2 years ago

Just to clarify but is the issue that since the change for df occurs within a function when the code calls the function, soorgeon cannot see that df has changed which means that df will not be marked as an output when it should? I'm assuming this only applies to function parameters, since for local variables, they wouldn't matter and global variables aren't supported. Also when it says "mutating its inputs" does that mean it applies to any input parameters that are mutable objects or just data frames? Thanks.

edublancas commented 2 years ago

(I edited my original comment to make it clearer)

Just to clarify but is the issue that since the change for df occurs within a function when the code calls the function, soorgeon cannot see that df has changed which means that df will not be marked as an output when it should?

correct

Also when it says "mutating its inputs" does that mean it applies to any input parameters that are mutable objects or just data frames? Thanks.

it applies to any input parameters

let me sync with Ido, this is low priority so there might be other stuff that will be better use of your time :)

idomic commented 2 years ago

@e1ha let's focus on the Algolia story for now and more to be coming soon!