ploomber / soorgeon

Convert monolithic Jupyter notebooks 📙 into maintainable Ploomber pipelines. 📊
https://ploomber.io
Apache License 2.0
78 stars 20 forks source link

automated notebook cleaning #49

Open edublancas opened 2 years ago

edublancas commented 2 years ago

Notebooks get messy. Suppose a data scientist is working on a notebook that downloads some data, cleans it, generates features, and trains a model. Let's now say we want to deploy this notebook so we run a scheduled job every month to re-train the model with new data. In production, we need to keep some of the notebook logic (but not all).

During development, data scientists usually add extra cells for exploration and debugging purposes. For example, I might add a histogram to see the distribution of the input features to see if the data exhibits certain properties. While useful during development, some of the notebook's cells (e.g. plotting a histogram) aren't needed for production, so cleaning the notebook will be useful for a smoother transition to production.

Note that this is related to soorgeon clean (https://github.com/ploomber/soorgeon/issues/50) but not the same. soorgeon clean only reformats the existing code. In this case, we're talking about making deeper changes to the user's code.

low-hanging fruit: using autoflake

The low-hanging fruit here is to remove unused imports and variables via autoflake. For example, say a notebook looks like this:

# cell 1
import math
import pandas as pd
df = pd.read_csv('data.csv')

# cell 2
df2 = pd.read_csv('another.csv')

# cell 3
df.plot()

if we run autoflake on that notebook, it'll be able to delete import math since the module is never used, and also df2 = pd.read_csv('another.csv') since df2 isn't used, and we'll end up with:

# cell 1
import pandas as pd
df = pd.read_csv('data.csv')

# cell 2
df.plot()

more advanced approach: automated variable pruning

A more advanced approach would be to delete everything that is not required to produce some final result. For example, say we have a notebook whose final result is to produce df_final:

import pandas as pd
df = pd.read_csv('data.csv')
df2 = pd.read_csv('another.csv')
df3 = pd.read_csv('final.csv')

df4 = do_something(df2, df3)

df_final = do_stuff(df1)

We can work backward and eliminate everything that does not affect df_final:

import pandas as pd
df = pd.read_csv('data.csv')

df_final = do_stuff(df1)

considerations

Pruning code is useful for cleaning up that's not needed for deployment but in some cases; the code might be needed again. For example, if I add some data exploration code (i.e. code to generate a plot), I may want to delete it as part of this automated cleaning process, but once the model is deployed, I might need that code again if the model fails and I need to debug things; I'm unsure on how to deal with this scenario