Notebooks get messy. Suppose a data scientist is working on a notebook that downloads some data, cleans it, generates features, and trains a model. Let's now say we want to deploy this notebook so we run a scheduled job every month to re-train the model with new data. In production, we need to keep some of the notebook logic (but not all).
During development, data scientists usually add extra cells for exploration and debugging purposes. For example, I might add a histogram to see the distribution of the input features to see if the data exhibits certain properties. While useful during development, some of the notebook's cells (e.g. plotting a histogram) aren't needed for production, so cleaning the notebook will be useful for a smoother transition to production.
Note that this is related to soorgeon clean (https://github.com/ploomber/soorgeon/issues/50) but not the same. soorgeon clean only reformats the existing code. In this case, we're talking about making deeper changes to the user's code.
low-hanging fruit: using autoflake
The low-hanging fruit here is to remove unused imports and variables via autoflake. For example, say a notebook looks like this:
if we run autoflake on that notebook, it'll be able to delete import math since the module is never used, and also df2 = pd.read_csv('another.csv') since df2 isn't used, and we'll end up with:
more advanced approach: automated variable pruning
A more advanced approach would be to delete everything that is not required to produce some final result. For example, say we have a notebook whose final result is to produce df_final:
We can work backward and eliminate everything that does not affect df_final:
import pandas as pd
df = pd.read_csv('data.csv')
df_final = do_stuff(df1)
considerations
Pruning code is useful for cleaning up that's not needed for deployment but in some cases; the code might be needed again. For example, if I add some data exploration code (i.e. code to generate a plot), I may want to delete it as part of this automated cleaning process, but once the model is deployed, I might need that code again if the model fails and I need to debug things; I'm unsure on how to deal with this scenario
Notebooks get messy. Suppose a data scientist is working on a notebook that downloads some data, cleans it, generates features, and trains a model. Let's now say we want to deploy this notebook so we run a scheduled job every month to re-train the model with new data. In production, we need to keep some of the notebook logic (but not all).
During development, data scientists usually add extra cells for exploration and debugging purposes. For example, I might add a histogram to see the distribution of the input features to see if the data exhibits certain properties. While useful during development, some of the notebook's cells (e.g. plotting a histogram) aren't needed for production, so cleaning the notebook will be useful for a smoother transition to production.
Note that this is related to
soorgeon clean
(https://github.com/ploomber/soorgeon/issues/50) but not the same.soorgeon clean
only reformats the existing code. In this case, we're talking about making deeper changes to the user's code.low-hanging fruit: using autoflake
The low-hanging fruit here is to remove unused imports and variables via autoflake. For example, say a notebook looks like this:
if we run autoflake on that notebook, it'll be able to delete
import math
since the module is never used, and alsodf2 = pd.read_csv('another.csv')
sincedf2
isn't used, and we'll end up with:more advanced approach: automated variable pruning
A more advanced approach would be to delete everything that is not required to produce some final result. For example, say we have a notebook whose final result is to produce
df_final
:We can work backward and eliminate everything that does not affect
df_final
:considerations
Pruning code is useful for cleaning up that's not needed for deployment but in some cases; the code might be needed again. For example, if I add some data exploration code (i.e. code to generate a plot), I may want to delete it as part of this automated cleaning process, but once the model is deployed, I might need that code again if the model fails and I need to debug things; I'm unsure on how to deal with this scenario