Closed philippjfr closed 3 weeks ago
Amazing write-up.
Here is an additional example with code, which would hopefully be addressed by the proposed changes.
In the code below, I believe that we have to redim
the channel_name
to the common value dimension (amplitude_dim
) in order for downsample1d
to work, but I think doing this redim
prevents the wide-dataframe-index-optimization, slowing things down as the number of lines scales up.
I'm happy with your analysis of the issue and I think I'm happy with your proposed solution.
If I'm following along correctly, it seems like users could have problems of the same sort as why multi_y
is not the default due to people previously having been sloppy about declaring dimensions in Overlays. I.e. people may have only declared a label for one Element out of an overlay, since that's all that's needed to get the axis label to update, and now only that one plot will match dimension for things like shared_axes and link_selections. Not handling sloppy code like that isn't a fatal issue with the approach, but it would be good to work out exactly when and if it would occur so that we can guide users.
In any case, would we then build on this support to add something at the hv.Dataset level where we can easily do a groupby
or by
on the wide dataframe and get this behavior, without explicitly having to construct an overlay or layout?
In any case, would we then build on this support to add something at the hv.Dataset level where we can easily do a groupby or by on the wide dataframe and get this behavior, without explicitly having to construct an overlay or layout?
Maybe we could avoid the explicit overlay construction at the hvPlot level.
I personally don't think the added brevity is a top priority in a HoloViews-dominant workflow.
In any case, would we then build on this support to add something at the hv.Dataset level where we can easily do a groupby or by on the wide dataframe and get this behavior, without explicitly having to construct an overlay or layout?
Agree with @droumis, I'm honestly fine with leaving that to hvPlot but also wouldn't be opposed if someone wanted to propose such an API for Dataset
.
Doing that at the hvPlot level makes good sense, yes.
I'm not worried about brevity so much as consistency, i.e. to ensure that there is a clean, well-supported, well-documented, tested way to work easily with a wide dataframe. Giving whatever way that is a name is one way to ensure that, but it can be done with documentation and examples instead if the code is clean. Would be good to see an example here of the HoloViews code that would be used to create a plot of one timeseries from such a dataframe at a time with a selector widget to select the stock name, in the absence of new API.
Here's what that looks like:
df = pd.read_csv('https://datasets.holoviz.org/stocks/v1/stocks.csv', parse_dates=['Date']).set_index('Date')
hv.NdOverlay({col: hv.Curve(df, 'Date', (col, 'Price')) for col in df.columns}, 'Ticker')
@philippjfr, how would I now adapt this code mentioned above? I'm seeing errors:
Or, not using the redim, but using the mapping as in your stocks example (curve = hv.Curve(df, kdims=[time_dim], vdims=[(channel_name, 'amplitude')])
) produces the same error as with the redim.
Simplifying to match the stocks example:
One thing that really confused things, was this:
hv.Overlay(curves, kdims="channel")
Overlays do not support key dimensions so this should be disallowed.
Across these three PRs this should now be fixed:
From the beginning HoloViews was designed primarily around tidy data. This has the major benefit that data can clearly be delineated into key dimensions (or independent values / coordinates) and value dimensions, which represent a dependent variable, i.e. some kind of measurement. Additionally it makes it possible to easily perform the groupby operations that allow HoloViews to easily facet data in a grid (GridSpace), layout (NdLayout), using widgets (HoloMap/DynamicMap) and as a set of trace in a plot (NdOverlay). However in many common scenarios data will not be tidy, the most common of which is when you are storing a bunch of timeseries indexed by the date(time) and then store multiple measurements all representing the kind of value, e.g. the most common example is stock prices where the index is the date and each column records the stock price for a different ticker.
The problem with reshaping this data is that it's tremendously inefficient. Where before you could have one DataFrame you now have to create
N
DataFrames, one for each stock ticker. So here I will lay out my proposal for formally supporting wide data in HoloViews.The Problem
While today you can already organize data in such a way that you create an NdOverlay where each Element provides a view into one column in the wide DataFrame, it breaks HoloViews' internal model of the world. E.g. let's look at what the structure of the ticker data looks like if you do this:
Here the ticker names now become the values of the NdOverlay key dimension AND they are the value dimension names of each
Curve
elements. This is clearly inelegant and also conceptually not correct, i.e. AAPL is not a dimension, it does not represent some actual measurable quantity with some associated unit. The actual measurable quantity is "Stock Price". The reason this is necessary is because the element equates the value dimension with the name of the variable in the underlying data, i.e. the string 'AAPL' will be used to look up the column in the underlying DataFrame. Downstream this causes issues for the sharing of dimension ranges in plots and other features that rely on the identity of Dimensions.The proposal
There are a few proposals that might give us a way out of this but they are potentially quite disruptive since HoloViews deeply embeds the assumption that the
Dimension.name
is the name of the variable in the underlying dataset. Introducing a new distinct variable on theDimension
to distinguish the name of the Dimension and the variable to look up does therefore not seem feasible. The only thing that I believe can be feasibly implemented is relying entirely on theDimension.label
for the identity of theDimension
. In most scenarios thename
andlabel
are mirrors of each other anyway but when a user defineslabel
that should be sufficient to uniquely identify the Dimension.Based on some initial testing this would already almost achieve what we want without breaking anything. Based on a quick survey the changes required to make this work are relatively minor:
Dimension.__eq__
should compare just thelabel
not thename
andlabel
ensuring thatDimension('AAPL', label='Price')
andDimension('MSFT', label='Price')
are treated as the same dimension.Dimension
andDimensioned
reprs should be updated to reflect thelabel
as the source of truth of the identity of the dimension.Dimension.label
This would be sufficient to fully support wide data without major disruptive changes to HoloViews, ensuring that linking of dimension ranges continues to work and that the reprs correctly represent the conceptual model HoloViews has of the data.