ploomber / ploomber

The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️
https://docs.ploomber.io
Apache License 2.0
3.5k stars 236 forks source link

Feature Request: Inversion of Control features should be supported in notebooks #1088

Open marr75 opened 1 year ago

marr75 commented 1 year ago

We're working on adopting ploomber as our pipeline management technology. In early experimentation, I've found that many of the best inversion of control features of ploomber don't seem to be supported for notebooks. I find this odd because of the amount of attention and ink spent on integrating jupyter notebooks.

Examples (in order of importance):

The extensions you've added to make a jupyter[lab] server work well with plain 'ol .py files are very useful but I was disappointed in the small subset of features available to notebook tasks. This breaks the most powerful features of ploomber when using notebooks. Pure python tasks can use clients and serializers to improve testability and make large changes possible with tiny reliable changes to the pipeline spec. You can develop using human-readable formats and the local filesystem and then use binary formats and cloud storage in production with a couple of lines of yaml when using pure python but this is not possible with notebooks. Further, ploomber teaches certain concepts and expectations around upstreams and products when using python tasks that are not valid when using notebooks.

Suggestion: abstract upstream and product into python objects you import instead of injecting dictionaries of strings into notebooks.

# %% tags=["parameters"]
# add default values for parameters here

# %% tags=["injected-parameters"]
# Parameters
upstream = {
    "input": "\some\wild\absolute-path\input.csv"
}
product = {
    "data": "\some\wild\absolute-path\data.csv",
    "nb": "\some\wild\absolute-path\nb.csv",
}

# %%
df = pandas.read_csv(upstream['input'])
result = do_some_stuff(df)
result.to_csv(product['data'])

could become:

# %% tags=["parameters"]
# add default values for parameters here
upstream, product = {}, {}

# %% tags=["injected-parameters"]
# Parameters
upstream, product = ploomber.nb.get_context() # knows the current state of the pipeline and uses it to populate upstream and product

# %%
df = upstream['input'] # deserializer and client populate the object instead of the path
result = do_some_stuff(df)
product['data'] = result # serializer and client encode and store the result instead of the notebook doing it using a path
edublancas commented 1 year ago

Hi @marr75,

Thanks for your feedback! Some thoughts.

Injection substitutes absolute paths. This results in multiple editors accidentally fighting over the upstream and product cells in source control even if no real edits are made to the notebook.

The quick way to fix this is to remove the injected cell before committing to git. The reason why we're injecting absolute paths is to reduce ambiguity because when you execute ploomber build, the current working director for all notebooks is the same as the one in your terminal when executing the command. However, when opening a notebook , JupyterLab sets the current working directory to the notebook's parent directory. This lead to user confusion in early versions of Ploomber.

However, I agree this creates the problem where paths resolve to different absolute values in different environments. I think an alternative would be to inject them as paths relative to the pipeline.yaml file. Thoughts?

Re serializers and clients: I like your ploomber.nb.get_context(). Our team is pretty small, so it'll take some time for us to get this going, but we're open to community contributions. Do you have time for contributing to this? I'm happy to guide you in the process.

marr75 commented 1 year ago

@edublancas I'm blown away by how responsive you are 😁

Honestly, digging deeper into the docs, you provide an example of exploring outputs of the rest of the pipeline using DAGSpec.find().to_dag(). I think this should probably be the recommended way to obtain a reference to the upstream and product objects (at least in a notebook). Maybe it needs some work or could be wrapped in a more notebook-specific abstraction. I'm willing to take a run at a contribution. Let me hack around a bit and then open a draft PR.

Related, many of the docs show examples where the task is responsible for serialization and "client" concerns, obtaining only paths from the upstream and product. I think this is ignoring the most powerful features of ploomber. I may be able to help here, too.

edublancas commented 1 year ago

Honestly, digging deeper into the docs, you provide an example of exploring outputs of the rest of the pipeline using DAGSpec.find().to_dag().

Yes! Shortly after posting my initial response, I thought that adding the context functionality is simple since we already have functions for finding and parsing the dag, so I'm glad you found it.

Feel free to open a draft PR to give early feedback. Also, you're welcome to join our community in case you have quick questions while working on setting up a dev environment, testing, etc.