ploomber / soorgeon

Convert monolithic Jupyter notebooks 📙 into maintainable Ploomber pipelines. 📊
https://ploomber.io
Apache License 2.0
78 stars 20 forks source link

Dealing with non-pickable files #9

Closed edublancas closed 2 years ago

edublancas commented 2 years ago

Some objects aren't pickable (e.g, jinja templates) so we should capture this errors at runtime and suggest fixes:

  1. use another library like cloudpickle (this can be an option), like sorgeon notebook.ipynb --serializer cloudpickle
  2. link to a guide to fix the problem (e.g., create a factory to instantiate the object)
idomic commented 2 years ago

We'd probably need to get the most flexible one here: cloudpickle/dill etc.

edublancas commented 2 years ago

yeah, so part of fixing this involves finding out which is the most flexible

neelasha23 commented 2 years ago

Yes I did read in an article that dill has more options than cloudpickle. This mentions the formats supported vs pickle vs dill : https://github.com/uqfoundation/dill/blob/master/dill/_objects.py Was also thinking if we should have both as options.? Also, it seems that the PicklingError is being thrown here : ploomber/executors/serial.py:161: DAGBuildError, so I need to catch this error here and raise the above mentioned suggestions to the user?

idomic commented 2 years ago

I think if it's not too much overhead we should support both as options, some users might not be familiar with the other. On the Error, definitely, we need to capture it and have a user friendly message that instructs on next steps/suggest how to fix.

edublancas commented 2 years ago

a few comments on how to auto-detect this.

we currently don't have a way of detecting it. soogeon analyzes the code to generate the pipeline but it never runs it, so we don't know if the user will run into this issue until they execute ploomber build for the first time. So my recommendation is as to add it as a CLI arg:

soorgeon refactor notebook.ipynb --serializer dill/cloudpickle/etc

Note: these should all be optional dependencies. so if a user selects dill but it doesn't have it installed it should throw an error telling them to install it first

Auto-detecting this problem is a much more difficult problem so I'd say let's only add it as a CLI arg for now

neelasha23 commented 2 years ago

Yes this is the approach I have followed (have added a serialize option in CLI ) The only way it can be detected is when executing ploomber build, so was thinking of providing suggestions to the user once it throws a PickleError during the ploomber build.

edublancas commented 2 years ago

I think let's skip this verification part for now. because a PickleError could be many things, perhaps the user did not generate the pipeline with soorgeon, so it'll be confusing. thoughts @idomic ?

neelasha23 commented 2 years ago

Ok. So we dont need to provide the link to guide or suggest user to use the serializer, snce there's no way of detecting that in soorgeon ? We just need to mention the serializer option in the docs?

edublancas commented 2 years ago

yeah, just mention it in the docs so they know they can use the --serializer option

idomic commented 2 years ago

Yeah I kind of agree, if someone came across this feature it'd be through the docs and it'll be clear from it that the dependency is needed. If this error happens just with core ploomber, it can even be a malformed code so it can create user confusion.

idomic commented 2 years ago

closing as we merged via #53