holoviz / holoviews

With Holoviews, your data visualizes itself.
https://holoviews.org
BSD 3-Clause "New" or "Revised" License
2.69k stars 403 forks source link

Support serialization of HoloViews objects in Dask #2768

Open shenker opened 6 years ago

shenker commented 6 years ago

I'm using dask.distributed to execute a data analysis pipeline that returns dicts of holoviews plots. Holoviews works fine if you can split your computation code (which returns pure numpy arrays or similar) and your plotting code (which takes those numpy arrays as input and returns holoviews plots; you'd run this entirely in the jupyter notebook kernel, not through dask).

For my use case, I can't do this. I want to construct holoviews ViewableElements on my remote execution nodes, serialize them and bring them back to the jupyter kernel, then display them (or combine them into Overlays/Layouts/HoloMaps, etc. before displaying them). For example, I may want to compute 1000 hv.Images on my remote nodes, bring them back to the jupyter kernel, turn them into a HoloMap, and display them as an animation.

To make this use case possible (I can't believe I'm the only one who wants to do this), I'll need to solve the following three problems:

1) Style options are not pickled together with the holoviews objects. This is documented in the FAQ, but still caught me off-guard. What is the rationale here? It seems that if I'm generating plots and moving them between python processes (using pickle), or generating plots and saving them to a file to open and display later (using pickle), I'd definitely want to keep any styling. I personally would strongly argue for changing the default behavior. Even if there's a good reason not to, it should be easier to change how this behaves. There are many situations where you're using a library which is hard-coded to use pickle.dumps. In my case, I just monkey-patched hv.core.dimension.LabelledData.__setstate__ to do what hv.Store.loads does (setting hv.Store.load_counter_offset = hv.StoreOptions.id_offset() before hv.core.dimension.LabelledData.__setstate__). I'd appreciate it if anyone has ideas about better ways to do this. In particular, I'm a bit worried about thread safety. Is there any reason not to use GUIDs instead of integer id's? That way you're never worried about clobbering/overwriting settings upon unpickling if you get the offset wrong.

A conceptual question: why do options need to live in a global hv.Options._custom_options dict instead of being attached to the holoviews objects themselves?

2) Holoview's use of global configuration variables. For example, I need to call hv.extension('bokeh') on the jupyter kernel as well as on all my dask workers. Going forward, I was thinking of monkey-patching hv.extension to automatically run itself on every worker in the dask cluster, so holoviews. Are there other global settings I need to worry about synchronizing between the client node and the remote nodes?

Why does holoviews need to know which backend is selected while constructing ViewableElements? (if it's only to check that the options are valid, I'd be happy to just turn option validation off during a .options call and delay the option validation until the ipython displayhook). I had imagined that you could build a holoviews plot, and later decide which backend you wanted to use to display it.

3) My understanding is that dask has its own high-performance pickle-based serialization functions for numpy/pandas objects. This isn't a blocker, but for serializing many holoviews plots each of which has a lot of data (as I am), I would eventually want to be able to leverage dask's numpy serializer. This is something that can be handled on the dask side through a custom serializer family, though. (I plan on posting example code once I get this working, perhaps as a patch to dask or holoviews if anyone wants to include it. Suggestions appreciated.)

Thoughts?

CC @mrocklin

jlstevens commented 6 years ago

For example, I may want to compute 1000 hv.Images on my remote nodes, bring them back to the jupyter kernel, turn them into a HoloMap, and display them as an animation.

This is conceptually quite similar to something Philipp and I did many years ago for our PhDs. We distributed hundreds of batch simulations on a cluster, output holoviews pickles and collected all of them back into the notebook. We had a fair bit of infrastructure to do this but unfortunately it was generally too confusing to make public along with the rest of holoviews. The key point is that this is a workflow we are familiar with! That said, we were using an HPC batch system of independent processes and not using dask.

Style options are not pickled together with the holoviews objects .... What is the rationale here?

Styling information is held separate from the objects themselves - you can read a bit about the design in our SciPy 2015 paper. This means that there is some bookkeeping needed to associate the pickled elements with the styles.

The way to store the styles as pickles is to use Store.load, Store.loads, Store.dump and Store.dumps which has the same interface as the pickle module. In the long term, we would like to have a more robust way to serialize holoviews objects but this is currently the best way.

A conceptual question: why do options need to live in a global hv.Options._custom_options dict instead of being attached to the holoviews objects themselves?

We made the decision a long time ago to separate data from the details of representation. This approach keeps the elements as simple wrappers around your data, makes it easier to set global default styles and makes elements exist independently of any particular plotting library.

Holoview's use of global configuration variables. For example, I need to call hv.extension('bokeh') on the jupyter kernel as well as on all my dask workers.

The notebook extension acts not so much a global configuration variable but as a way of loading a large chunk of Javascript into the notebook once instead of with every plot. It does set some necessary global state such as the active renderer e.g to keep track of whether the user is currently using matplotlib or bokeh.

You can pickle holoviews objects without any plotting library installed unless you want to apply and preserve styles in which case you do need to activate the appropriate renderer. You can do this with hv.extension which is also designed to work outside the notebook environment. If you use Store to pickle objects, you shouldn't have to worry about synchronizing any other state.

I had imagined that you could build a holoviews plot, and later decide which backend you wanted to use to display it.

You can do this though applying plotting extension specific styles does currently assume that the corresponding plotting extension is available.

This isn't a blocker, but for serializing many holoviews plots each of which has a lot of data (as I am), I would eventually want to be able to leverage dask's numpy serializer.

This is where the difference between independent batch processes and using dask becomes relevant and where I might suggest an alternative approach. Why build the holoviews objects on the workers? Why not work with the natively support dask data structures (e.g dask arrays/dataframes) then construct the holoviews objects locally from them? Dask will then do the job of processing the data and you only need holoviews in the one place where you need to pull everything back together.

This would be the approach I advise when using dask whereas pickling individual holoviews objects to disk is what I would use when handling a large batch of independently run processes.

Hope that helps!

jlstevens commented 6 years ago

One thing we should do is revisit how we handle pickles. It is possible we now have a better idea how to get everything (i.e including the handling of the option trees) working within the normal pickle machinery. This would mean there wouldn't need to be a distinct API and it would work more smoothly with dask.

jmakov commented 1 year ago

To bump the issue a bit, I'm doing this on ray.io instead of Dask. Examples code in the forum. When using plot() it works (only plot options are not applied and need to be applied when the processed plot is gathered), using .plot(datashade=True) though doesn't work and produces the following traceback:

File ~/workspace/venv/puma-lab/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:105, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
    103     if func.__name__ != "init" or is_client_mode_enabled_by_default:
    104         return getattr(ray, func.__name__)(*args, **kwargs)
--> 105 return func(*args, **kwargs)

File ~/workspace/venv/puma-lab/lib/python3.10/site-packages/ray/_private/worker.py:2309, in get(object_refs, timeout)
   2307     worker.core_worker.dump_object_store_memory_usage()
   2308 if isinstance(value, RayTaskError):
-> 2309     raise value.as_instanceof_cause()
   2310 else:
   2311     raise value

RayTaskError(TypeError): ray::get_plot() (pid=36727, ip=192.168.0.107)
  File "/home/toaster/workspace/venv/puma-lab/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/home/toaster/workspace/venv/puma-lab/lib/python3.10/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 627, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle 'weakref.ReferenceType' object