ploomber / ploomber-engine

A toolbox šŸ§° for Jupyter notebooks šŸ“™: testing, experiment tracking, debugging, profiling, and more!
https://engine.ploomber.io
BSD 3-Clause "New" or "Revised" License
66 stars 14 forks source link

Possible to avoid translating `tuple` parameter values as strings, or else allow a user-supplied `Translator`? #86

Open blakeNaccarato opened 10 months ago

blakeNaccarato commented 10 months ago

Parametrizing a notebook on parameters={"list_param": [0,1], "tuple_param": (0,1)} will result in list_param value being injected as a list, but the tuple_param value being injected as the stringified representation of a tuple, e.g. "(0, 1)" (literally with the quotes). I understand that ploomber-engine strives to match the papermill API, so maybe this is intentional to maintain that compatibility, but is there a chance for the parameter translator to faithfully inject tuples?

Maybe a more sensible request would be to allow the user to pass a custom Translator to PloomberClient, which could pass the custom Translator subclass to parametrize_notebook (below), allowing the user to customize behavior, in the case that it is desired to mimic the papermill API surface by default?

Contents of translate

Here we see that tuple falls through to the "last resort" translator which stringifies it.

https://github.com/ploomber/ploomber-engine/blob/e00bdc2d7676955a9fc00cd514fa9c98a92bf904/src/ploomber_engine/_translator.py#L76-L95

Abbreviated contents of parametrize_notebook

https://github.com/ploomber/ploomber-engine/blob/e00bdc2d7676955a9fc00cd514fa9c98a92bf904/src/ploomber_engine/_util.py#L72-L73 ... https://github.com/ploomber/ploomber-engine/blob/e00bdc2d7676955a9fc00cd514fa9c98a92bf904/src/ploomber_engine/_util.py#L92-L94

edublancas commented 10 months ago

We want to keep compatibility with papermill (although I agree that converting tuples into strings is an odd API choice).

I'm open to extending the API so users can customize the translator, what's the use case for this? wouldn't it be easier to convert the tuple into a list before executing the notebook?

blakeNaccarato commented 10 months ago

We want to keep compatibility with papermill

Yeah, the papermill compatibility guarantee is understandable.

wouldn't it be easier to convert the tuple into a list before executing the notebook?

Yes, that's what I ended up doing in my case.

I'm open to extending the API so users can customize the translator, what's the use case for this?

I think I'm hitting these surprises because I'm interacting with namespaces of parametrized notebooks, a less-common use-case. I chose ploomber-engine instead of Ploomber proper to keep it lightweight and more adaptable to the patterns I'm using (particularly in testing).

I am injecting parameters and handling namespaces in calling code. One rationale is to keep file-IO logic out of the notebook logic. In tests, I pass different parameters through a notebook and assert certain namespace results.

Allowing a substitute translator would probably be better than special-casing the existing translator. I figured it could be a slippery-slope if you do tuple today, then someone wants slices, and so-on. An alternative extension point would allow users to pass an override mapping of types to handler functions, which wouldn't require subclassing Translator.

But a reasonable choice is also "the translator works the way that it does, we'll make sure the surprises are well-documented". It's really not that big of a deal to be limited to str, bool, int, float, dict, and list. If you like, I'm open to authoring the documentation change if that's the path chosen. I could also manage the feature implementation if that's the choice, once an approach is decided.

As an aside, get_namespace() doesn't directly accept parametrization like execute_notebook()does. So this requires importing some ploomber_engine internals, ploomber_engine._util.parametrize_notebook. Facilitating that on get_namespace() would be really cool!

I've provided more implementation details of this use-case below.

Details #### Function [`get_nb_ns()`](https://github.com/blakeNaccarato/boilercore/blob/4deddd0f92677731b6319f0f4cb4491d781976b4/src/boilercore/notebooks/namespaces.py#L51-L63), getting parametrized namespace using `ploomber-engine` internals Also allows returning a subset of the notebook namespace, e.g. usually just a few attributes for manipulation in module code. ```Python def get_nb_ns( nb: str, params: Params = NO_PARAMS, attributes: Attributes = NO_ATTRS ) -> SimpleNamespace: """Get notebook namespace, optionally parametrizing it.""" nb_client = get_nb_client(nb) # We can't just `nb_client.execute(params=...)` since`nb_client.get_namespace()` # would execute all over again if params: parametrize_notebook(nb_client._nb, parameters=params) # noqa: SLF001 namespace = nb_client.get_namespace() return SimpleNamespace( **({attr: namespace[attr] for attr in attributes} if attributes else namespace) ) ``` #### Implementation of [`export_contours()`](https://github.com/blakeNaccarato/boilercv/blob/1458ae4409e42012a753aced4694b53b11c52038/src/boilercv/stages/experiments/e230920_subcool/__init__.py#L55-L61) The below `export_contours()` is run concurrently for different inputs via `ProcessPoolExecutor` to run the notebook for different conditions and write a resulting namespace object to HDF. This facilitates "sandwiching" the [notebook logic](https://github.com/blakeNaccarato/boilercv/blob/main/src/boilercv/stages/experiments/e230920_subcool/find_contours.ipynb) with parameters and file I/O. ```Python def export_contours(params: Params): """Export contours.""" dest = EXP_DATA / "contours" ns = get_nb_ns(nb=read_exp_nb("find_contours"), params=params) dest.mkdir(exist_ok=True) path = (dest / f"contours_{ns.PATH_TIME}").with_suffix(".h5") ns.contours.to_hdf(path, key="contours", complib="zlib", complevel=9) ``` #### Some [tests](https://github.com/blakeNaccarato/boilercv/blob/1458ae4409e42012a753aced4694b53b11c52038/tests/boilercv_tests/test_e230920_subcool.py#L54-L72) exercising notebook namespaces (parametrization boilerplate omitted) ```Python ... def test_custom_features(ns, ax): ns.objects.plot(ax=ax) assert_index_equal(ns.objects.columns, ns.my_objects.columns, check_order=False) # type: ignore ... def test_find_collapse(ns): objects = ns.nondimensionalized_departing_long_lived_objects assert objects["Dimensionless bubble diameter"].min() < 0.2 ... def test_get_thermal_data(ns): assert ns.data.subcool.mean() == pytest.approx(3.65, abs=0.01, rel=0.01) ```
edublancas commented 10 months ago

Thanks for the detailed answer!

I think we can break down the action items:

  1. Document how the parameter translation works so users know which data types are supported (and that any unsupported one is converted into a string)
  2. We can adopt a subclass approach to customize PloomberClient (see code below)
  3. If you also want to work on adding parametrization support for get_namespace, feel free to! Just open a new issue so we can track progress separately

Subclassing PloomberClient:

class MyCustomClient(PloomberClient):

    def translate_parameters(self, parameters):
        # custom translation logic

client = MyCustomClient(...)
client.execute(...)

This would imply modifying PloomberClient so it uses a default translator but subclasses could override this by re-writing a translate_parameters method. Thoughts?

blakeNaccarato commented 10 months ago

Thanks for the detailed answer!

Thanks! It was so detailed, that I got self-conscious and folded half of it under <details> šŸ˜. I think I'm about to outdo myself with this essay of a response, though. I've numbered relevant bullet points so you can refer to them in shorthand, e.g. (3), or help me decide the necessary ones per acceptance critera for the current work.

Also, it's nearly Thanksgiving, so know our communication will likely pause abruptly. I plan to work on it a bit over the weekend, but have enough to do in dev environment setup/documentation that I won't get into the implementation details this week, I'm sure.

If you also want to work on adding parametrization support for get_namespace, feel free to! Just open a new issue so we can track progress separately

(3) I'll open an issue on this sometime soon, but I'll prioritize working on the current matter first. If someone else comes along and wants to work the other PR in the meantime, that's fine, too.

Plan

(4) Unless otherwise noted, I'll assume the following work happens in one PR. Let me know if you recommend a different grouping/sequence of work, e.g. if certain tasks should wait for a follow-up.

Documentation

Document how the parameter translation works so users know which data types are supported (and that any unsupported one is converted into a string)

Modification of PloomberClient

We can adopt a subclass approach to customize PloomberClient. [...] This would imply modifying PloomberClient so it uses a default translator but subclasses could override this by re-writing a translate_parameters method. Thoughts?

Tests

Conclusion

This work will wire up the papermill-consistent translation to a translate_parameters method on PloomberClient, and expose this method for overriding through subclassing. Unexpected interactions are likely wherever existing subclasses of PloomberClient are in use, and it's not exactly clear to me how to handle those right now.

Maybe this work can be broken into a few PRs. Most important is the proposed feature, directly-related documentation, tests, and exposing this feature through various helper functions like execute_notebook(). I think that enabling custom translation is worth the effort, though.

edublancas commented 10 months ago

@neelasha23: looping you so you're aware of this

(1) If I create this as a draft PR, will that avoid noise for maintainers and assignments? I don't want to create noise until the PR is ready or I explicitly reach out with further questions.

That's fine, opening a draft PR won't create any noise

(2) If I need to mention one of the maintainers with an issue, should I ping you or someone else? Here or in the PR?

you can tag me and @neelasha23, she'll be reviewing your code as well

(11) Separate PR? Should any Changed in version ... placeholders be added in this work/PR, or is that handled in a different process?

Please add an "added in version" in the docstring for the new translate_parameters method

(12) Write a default translate_parameters method.

we can use ploomber_engine._translator.translate_parameters()

(13) Should existing calls to parameter translation throughout the codebase be changed to use the PloomberClient.translate_parameters() method instead? Worried about leaving dangling logic pointing to prior implementation.

Unsure about this, where in the codebase are we calling the translate parameters? I'm tempted to keep those as is since calling PloomberClient.translate_parameters() should only happen inside the PloomberClient class. But let me know where you find other call instances

(14) How can current exposed functionality like execute_notebook() operate on a custom user PloomberClient? It seems it's instantiated now via INIT_FUNCTION, so a client with custom translation won't be able to be used here? Should a client argument be added to such functions, allowing a custom client in? This will likely be incompatible with PloomberMemoryProfilerClient unless users actually subclass that one. I guess documentation in (7) is tied to this change.

Fair point. I think let's leave this point out of the initial work. We can think of ways for users to call execute_notebook with a custom client. But let's not add too many things to a single PR.

(15) Separate PR? Verify existing subclasses of PloomberClient still work (already covered by tests?), e.g. PloomberLogger (especially since its own execute method override has a FIXME indicating duplicating some logic from PloomberClient) and PloomberMemoryProfilerClient. Maybe these are already sufficiently isolated and won't be affected as long as they don't override the translate_parameters method.

Good catch. If you feel like there are missing unit tests, please add them. This can be the first thing to tackle, as it'll ensure you that your changes aren't breaking existing functionality.

(16) If users are encouraged to subclass PloomberClient for translate_parameters handling, are there any breaking uses I should be aware of while implementing this method? I'm particularly interested in how a given client instance gets its hooks, like self.on_notebook_start and self.oncell* (used here). Can a user subclass inadvertently mess up those hooks? Or am I mixing these up, it seems maybe PloomberNotebookClient is actually not involved here?

Right PloomberNotebookClient is not involved. PloomberNotebookClient is a legacy client that derives from nbclient, this was our first implementation, before we implement a notebook executor from scratch. We consider this legacy code and might remove it in the future.


I hope I didn't miss anything important! Feel free to tag me in your draft PR if you have questions

blakeNaccarato commented 10 months ago

Thanks for the thorough response. I created a draft PR, so I'll continue the discussion over there and ping you both as needed for clarification. First I'll establish a rough scope, set up my environment, and write tests covering existing subclass usage as needed, before getting into the actual implementation.

edublancas commented 10 months ago

sounds good!

blakeNaccarato commented 9 months ago

It looks like time has gotten away from me here, so I figured I'd chime in to say I'm still alive. I haven't gotten around to the implementation, though I do still intend to, and would like to put time into it this month. If any passers-by see this and are eager to work on this themselves, just @ me and I'll try to respond promptly.