holoviz / holoviews

With Holoviews, your data visualizes itself.
https://holoviews.org
BSD 3-Clause "New" or "Revised" License
2.69k stars 402 forks source link

[Question] hv.Dataset default construction from xr.DataArray #3749

Open cocoaaa opened 5 years ago

cocoaaa commented 5 years ago

Hi, I'm trying to understand how the dims and coords attributes from xr.DataArray object gets propagated to the hv.Dataset object when we don't specify any kdims and vdims, but I wasn't able to find explicit documentation on this. Based on my experiments, here is what I got:

image

Inspecting the new hv.Dataset object...

Could you help me understand where does this difference arise from? In general, I'm quite confused about how the default constructor of hv.Element (ie. hv.Element(data, kdims=None, vdims=None) sets the kdims and vdims of the resulting Element instance. I'm trying to understand how it behaves differently when the data is unlabelled type (eg. numpy.ndarray) vs. when the data is labeled (eg. xarray.DataArray or pd.DataFrame). For instance, does hv.Dataset's constructor use the xr.DataArray's coords? or just dims?

If there is documentation on this, I'd appreciate if you could point me to that. Thank you!

philippjfr commented 5 years ago

Note that the time dimension's coords is not correctly propagated:

Dimensions generally only store metadata about the dimension, while xarray stores both metadata and data (the actual coordinate values) on the coordinates. This is where this difference is coming from, duplicating the data that's already inside the xarray datastructure would be very inefficient.

In general, I'm quite confused about how the default constructor of hv.Element (ie. hv.Element(data, kdims=None, vdims=None) sets the kdims and vdims of the resulting Element instance.

This is somewhat dependent on the datatype, for unlabelled datastructures like numpy arrays it assumes the data conforms to the shape implied by the default dimensions of the element. For labelled datastructures like xarrays it will try to infer what you want. In the case of a pandas dataframe it will also try to use all the columns that are available while also respecting the dimensionality of an element (i.e. a Points element has two key dimensions so it will by default assume that the first two columns are the kdims and the remainder the vdims.

We should certainly do better about documenting this behavior.

cocoaaa commented 5 years ago

This is somewhat dependent on the datatype, for unlabelled datastructures like numpy arrays it assumes the data conforms to the shape implied by the default dimensions of the element. For labeled data structures like xarrays, it will try to infer what you want. In the case of a pandas dataframe, it will also try to use all the columns that are available while also respecting the dimensionality of an element (i.e. a Points element has two key dimensions so it will by default assume that the first two columns are the kdims and the remainder the vdims.

Thank you for your clarification! It really helps to know that for unlabelled datastructure, holoviews' Element constructor assumes the data's dimensional order conforms to the target Element's default dimensions! It's been a major question for me. I think it would be really helpful to have this explanation included in the documentation, eg. to Getting Started -> Gridded Datasets section. In the tutorial, I couldn't understand why the order of kdims arguments needed be ['Time', 'x','y'] ( for the hv.Dataset constructor with np.array as the input data), since this order is different from numpy's dimensionality order (eg. dim0 is row/height, dim1 is column/width, dim2 is depth/nChannels) and seems rather arbitrary: image

Now that I'm more familiar with the netCDF data model and xarray, I can see where this is coming from, but I have to say it is still confusing, especially with the hv.Dataset constructor which is more flexible in defining kdims and vdims (versus hv.Points which has a fixed number of kdims accepted). To elaborate on this, I would expect something like this to work with a 3-dimensional np.array:

h,w,nc = 20,20,3
np.random.seed(0)
data = np.random.randn(h,w,nc)

# to hv.Dataset
# Doesn't work as expected
ds = hv.Dataset(data, kdims=['dim0','dim1'],vdims=['r','g','b']) 
# or perhaps this?
ds = hv.Dataset(data, kdims=['dim0','dim1','dim2'],vdims=['val']) 

This is because I'm expecting the kdims arguments to annotate the axes of already existing data input, rather than the (not-yet existing, but being constructed) hv.Dataset instance. However, from your explanation, it seems like the right way to think about the kdims arguments to hv.Element's constructor is from the expected dimensions of the elements to the input data. Please correct me if this is still not entirely correct.

Hope my attempt to explain this confusion itself is not too confusing:) Just sharing a thought.

poplarShift commented 5 years ago

Well if data is a 3D numpy array then you cannot declare more than 3 different dimensions, be they kdims or vdims, so I wouldn't intuitively expect any of your two proposed alternatives to work. I am not sure what you mean by "already existing data input" vs "expected dimensions of the elements". Every element has a number of ways to construct them. But the whole business of declaring dimensions is about annotating the data itself and from there on holoviews will tell you what kind of visualizations it thinks are compatible. E.g. if your data have two kdims, you can't visualize them as a Scatter element [*], because Scatter is only for cases where you have one kdim and one or more vdims.

In other words, on a purely conceptual level, passing the dimensions is external information about the data, not in any way related to what you want to do with them. That is one of the key distinctions between imperative and declarative visualization.

Hope this helped.

[*] well, you can, but in that case holoviews will automatically do the faceting for you and put the remaining kdim into the kdims of e.g. a HoloMap.