Data API - Githubissues

philippjfr commented 9 years ago

Now that we've made HoloViews plotting backend independent it's time to talk about generalizing the data backends. Reviewing the current situation I have to say we haven't done quite as good a job as we could have done but all the foundations are there to unify things very nicely and iron out kinks like the lack of support for categorical or date types in various charts. As part of the dataframe refactor we've already begun looking at this but it's been more of an ad-hoc way of patching in support for dataframes.

Today it occurred to me that we already have three different storage formats for table-like data and a lot of code built up to convert and operate on the different data types. They include:

NdElement implements a table as an NdMapping, i.e. a fancy dictionary. For large tables this is very slow.
Chart objects while never displayed as such are effectively multiple columns stored in an NxD numpy array, where N is the number of samples and D the number of dimensions/columns.
DFrame as it's name suggests simply uses a pandas dataframe as the underlying datasource.

It would be nice if all subclasses of these Element types could be unified to use any of these datasources. Currently different Element types use different data formats and it's all a bit of a mess and not very flexible. If we are successful with this the following Elements will interchangeably support all three data formats: Curve, Points, VectorField, Bars, Table, ErrorBars, Spread, Scatter, Scatter3D, Distribution, TimeSeries and with some minor adjustments Histogram.

I'm not entirely certain about the mechanism yet but I envision that a we introduce a new wrapper class which dispatches any method calls to one of the three baseclasses depending on the type of the data. We would provide a common API for each of the backends so custom Elements, plotting code and the user can access the data in very well defined ways independent of the data backend. Luckily we've generally made sure that such a common API exists even if we haven't used it consistently.

The current API for Element types is this:

dimension_values: Returns an array, list or pandas series of one column
range: Returns the minimum and maximum values along a dimension excluding NaNs.
select: Same as __getitem__ but works for one or multiple dimensions.
dframe: Converts data to a dataframe
collapse_data: Applies a function across a list of Element.data attributes
reduce: Applies a reduce function across one or multiple axes
sample: Returns a Table with of the samples specified as lists
groupby: Groups by the values along one or multiple dimensions returning an NdMapping type indexed by the grouped dimensions with Elements of the same type as values. (not implemented for Charts)
reindex: Reorder specified dimensions dropping unspecified dimensions. (not implemented for Charts)

First of all I would like a better name for dimension_values, it's one of the most important methods as it returns data along one column/dimension the obvious suggestion would be values but that clashes with the dict interface of NdMapping types. Would it be terrible to let values accept an optional argument to specify the dimension, otherwise returning all value dimensions like it does currently?

Secondly I would suggest a few additions to the API, mainly data conversion, which simply clone the Element but convert between the three data formats, something like:

.as_array
.as_dframe
.as_mapping

Conversion between Element types would now be trivial since they support the same data formats and I would also deprecate the table and dframe methods.

Finally I'd like to suggest that we also have a common NumPy like indexing interface, for the array format it would just index into the data directly, the dataframe version can just use the df.ix[idx] interface and the implementation for the NdMapping based tables is also trivial. For some operations this is going to be significantly faster than going via our value based indexing and it would be nice to do this in a backend agnostic way.

The great thing about this is that most of the implementation for each of these data backends is already there in the Chart, NdElement and DFrame classes. I think unifying them in this way will really clean up our data API and allow us to do some more powerful things with different interfaces in future. Once we've figured this out we'll have to consider how to extend it to other Element types like Paths and Raster types.

In terms of timeline, I think this could be done in one or two days of concentrated work. I think it might even make sense to merge our intermediate attempt at integrating dataframe support as soon as we're satisfied it works and then carry this refactor out for a v1.5 release.

Edit:

The API is actually a little bit larger than what I've described above, as it will necessarily include the NdMapping API:

keys: Returns the values of the key dimensions as a list of tuples
values: Returns the values of value dimensions as a list of tuples
items: Key and values combined
get: Dictionary semantics
pop: Dictionary semantics
drop_dimension: Drops specified dimension(s)/column(s)
add_dimension: Adds named column at specified position with supplied value(s)

While keys won't usually be helpful, items will basically be the API to convert to an NdMapping like representation, and both drop_dimension and add_dimension will be useful. Having get and pop on Chart types is somewhat annoying though.

jlstevens commented 9 years ago

First, I should say that the general gist of this proposal seems like a very good idea to me. This is definitely something we want as it will make HoloViews more consistent and more powerful!

Would it be terrible to let values accept an optional argument to specify the dimension, otherwise returning all value dimensions like it does currently?

That sounds perfectly reasonable to me although I would want to check the return types stay consistent. We would add a deprecation warning to dimension_values for a while.

.as_array .as_dframe .as_mapping

Why not drop the as_ prefix and leave the .dframe and .table alone? Then we can just add .array.

If we use either of these suggestions, the return values better be as stated and not elements by default. The .array method better return an actual array etc. unless explicitly specified otherwise. Maybe you can get an element back with a switch?

foo.as_array()  # Return a NumPy array
foo.as_array(element=True)  # Now same type as foo

These methods should redirect to some utility class where the functionality is implemented. This reflects how I think we should tackle this problem in general:

Select the allowable formats we want for the .data attribute.
Use a utility class to access this .data attribute (no matter what format it is) in a consistent way within the code. The idea is to abstract away the details of the underlying format.

I firmly believe in utilities for this sort of refactor as they can be 1. understood in isolation 2. tested in isolation 3. used in isolation 4. can be extended in future (e.g xray). Personally I would create a DataConversion class containing only classmethods. I suspect many of these classmethods would take a data format as a first argument (e.g self.data in many cases) to return the data in a known format regardless of what the input format was.

Some things I would expect from the conversion utility:

Code paths using isinstance to keep things fast. E.g if a dframe is required and the input is already in this format, use just the pandas API necessary to do the conversion efficiently and return the result.
Support the NumPy indexing API you suggest above as I agree this would help make life easier.
Support for taking in HoloViews elements as input and returning one of the acceptable .data formats. This would be used in the constructors of Elements to support casting between elements.
We might want a classmethod that takes in the input to an Element constructor, the kdims and vdims and returns the appropriate Dimension objects. Again this would help with casting between elements.

If done properly, we could eventually get rid of the table conversion utilities (deprecate them gradually) because we would be able to immediately cast any element into any other type! In addition we would get the benefits of supporting new data formats (heterogeneous types, speed improvements etc).

philippjfr commented 9 years ago

So I thought about the issue of inheriting the entire NdMapping API again and I've come to the conclusion that the baseclass should not itself inherit from NdMapping and hold an OrderedDict. Instead the data may optionally be an NdElement (the current Table baseclass) type. This means the data of this baseclass is either an array, a dataframe, or an NdElement. This would break old pickles containing Tables but a small legacy wrapper in __setstate__ could fix this.

jbednar commented 9 years ago

This sounds like a great proposal to me. Can the data can also be a Blaze object?

On Fri, Sep 18, 2015 at 1:56 PM, Philipp Rudiger notifications@github.com wrote:

So I thought about the issue of inheriting the entire NdMapping API again and I've come to the conclusion that the baseclass should not itself inherit from NdMapping and hold an OrderedDict. Instead the data may optionally be an NdElement (the current Table baseclass) type. This means the data of this baseclass is either an array, a dataframe, or an NdElement. This would break old pickles containing Tables but a small legacy wrapper in setstate could fix this.

— Reply to this email directly or view it on GitHub https://github.com/ioam/holoviews/issues/269#issuecomment-141535516.

philippjfr commented 9 years ago

Yes, we've already tested that a little bit on the dataframe branch. This proposal just makes sure that we don't just patch in the support all over the place and have a general API to extend in future, e.g. with x-ray.

jlstevens commented 9 years ago

I've already committed some support for dask arrays to Raster on the dataframe branch. That work is related to this proposal but not part of it - as Philipp says, we want to generalize this sort of functionality correctly.

jlstevens commented 9 years ago

I've talked again to Philipp and I feel it is worth summarizing our thoughts here:

We agree on the idea of using a utility class to provide a consistent API regardless of the underlying data structure.
We can update the current base classes (i,e Chart, NdElement, DFrame) to make use of these utilities.
Hopefully we can then eliminate some of these base classes by unifying them together which will help clean up the code.

One sticking point has been decided how on what format to use as a pure python tabular format (i.e if pandas is not available). The question has been whether Table or NdElement should be accepted as .databut we've now decided on the following scheme:

The .data attribute can be 1. a Numpy array 2. a pandas DataFrame 3. an NdElement 4. the .data attribute of whatever other element is passed into the constructor (i.e. back to one of the first three options).
Table will no longer inherit from NdElement (but it will inherit from Element at some level). This is a generalized version of Table (that also replaces DFrame) and will support the three formats mentioned above.
NdElement will be given a rich representation just like Table (i.e. it will inherit from Tabular) but unlike Table it will not support DataFrame or numpy arrays in the constructor. It will be the version of Table that does not require third-party libraries (i.e pure python) and will have no subclasses (Bars will be updated accordingly).
NdElement if unified with the other base classes as proposed above should be renamed. We've proposed the name Columnar (or Columns?) as that is the main feature of data, no matter the details of the data structure used. I think I vote for Columns as it will be a proper element and should be a noun.
We may want to consider an analogue class (ArrayLike?) for Raster types that support numpy and dask arrays.

This approach avoids unnecessary nesting of the data held by the .data attribute and resolves the issue of how Table can work without pandas being available (which would be an issue if Table were allowed to be .data).

There is a lot to think about and we will need to make sure our API abstracts over the different data types while remaining rich enough to be useful in practice (e.g for implementing operations).

jbednar commented 9 years ago

Sounds good. I too prefer Columns to Columnar.

Will it be clear how to add other .data formats in the future, apart from the three above? And is option 1 really required to be a numpy array, or can it be anything supporting a numpy array's interface? E.g. blaze tries to support that.

jlstevens commented 9 years ago

Will it be clear how to add other .data formats in the future, apart from the three above?

Yes, there will be an Interface base class that determines what needs to be implemented e.g for xray or maybe unit arrays (from astropy).

can it be anything supporting a numpy array's interface? blaze tries to support that.

From my understanding, you need blaze+dask for out-of-core arrays. I have a prototype of this but unfortunately, it isn't quite a numpy interface - if you have a dask array you need to call the .compute method to load the array-like object from disk. See here for the prototype utility I was using to handle this.

In other words, we can support dask arrays as .data with the proposed system, but there would need to be some logic (hopefully very little!) to call compute in the appropriate interface class.

Finally, I am going to assign this to the 1.4.0 milestone as I think this refactor will greatly expand the generality and power of HoloViews. It should be possible to make all this work without breaking pickles (very important!) and I don't think we should delay getting this implemented.

Edit: Using dask arrays by calling compute early on in the interface class would be an easy thing to do, but this doesn't use dask very well. A proper Interface class would be a better idea that uses the dask API to make use of all the efficient out-of-core computation it offers.

jlstevens commented 9 years ago

It is also worth mentioning that Philipp and I agree that these Interface classes would be a very natural place to add unit support at some point. We should keep this in mind as an additional goal for this system that we will want to think about later on.

philippjfr commented 8 years ago

This has just been merged, closing the issue.

holoviz / holoviews

Data API #269