Closed philippjfr closed 8 years ago
First, I should say that the general gist of this proposal seems like a very good idea to me. This is definitely something we want as it will make HoloViews more consistent and more powerful!
Would it be terrible to let values accept an optional argument to specify the dimension, otherwise returning all value dimensions like it does currently?
That sounds perfectly reasonable to me although I would want to check the return types stay consistent. We would add a deprecation warning to dimension_values
for a while.
.as_array .as_dframe .as_mapping
Why not drop the as_
prefix and leave the .dframe
and .table
alone? Then we can just add .array
.
If we use either of these suggestions, the return values better be as stated and not elements by default. The .array
method better return an actual array etc. unless explicitly specified otherwise. Maybe you can get an element back with a switch?
foo.as_array() # Return a NumPy array
foo.as_array(element=True) # Now same type as foo
These methods should redirect to some utility class where the functionality is implemented. This reflects how I think we should tackle this problem in general:
.data
attribute..data
attribute (no matter what format it is) in a consistent way within the code. The idea is to abstract away the details of the underlying format.I firmly believe in utilities for this sort of refactor as they can be 1. understood in isolation 2. tested in isolation 3. used in isolation 4. can be extended in future (e.g xray). Personally I would create a DataConversion
class containing only classmethods. I suspect many of these classmethods would take a data format as a first argument (e.g self.data
in many cases) to return the data in a known format regardless of what the input format was.
Some things I would expect from the conversion utility:
isinstance
to keep things fast. E.g if a dframe is required and the input is already in this format, use just the pandas API necessary to do the conversion efficiently and return the result..data
formats. This would be used in the constructors of Elements
to support casting between elements.kdims
and vdims
and returns the appropriate Dimension
objects. Again this would help with casting between elements.If done properly, we could eventually get rid of the table conversion utilities (deprecate them gradually) because we would be able to immediately cast any element into any other type! In addition we would get the benefits of supporting new data formats (heterogeneous types, speed improvements etc).
So I thought about the issue of inheriting the entire NdMapping
API again and I've come to the conclusion that the baseclass should not itself inherit from NdMapping and hold an OrderedDict. Instead the data may optionally be an NdElement (the current Table baseclass) type. This means the data of this baseclass is either an array, a dataframe, or an NdElement. This would break old pickles containing Tables but a small legacy wrapper in __setstate__
could fix this.
This sounds like a great proposal to me. Can the data can also be a Blaze object?
On Fri, Sep 18, 2015 at 1:56 PM, Philipp Rudiger notifications@github.com wrote:
So I thought about the issue of inheriting the entire NdMapping API again and I've come to the conclusion that the baseclass should not itself inherit from NdMapping and hold an OrderedDict. Instead the data may optionally be an NdElement (the current Table baseclass) type. This means the data of this baseclass is either an array, a dataframe, or an NdElement. This would break old pickles containing Tables but a small legacy wrapper in setstate could fix this.
— Reply to this email directly or view it on GitHub https://github.com/ioam/holoviews/issues/269#issuecomment-141535516.
Yes, we've already tested that a little bit on the dataframe branch. This proposal just makes sure that we don't just patch in the support all over the place and have a general API to extend in future, e.g. with x-ray.
I've already committed some support for dask arrays to Raster on the dataframe
branch. That work is related to this proposal but not part of it - as Philipp says, we want to generalize this sort of functionality correctly.
I've talked again to Philipp and I feel it is worth summarizing our thoughts here:
Chart
, NdElement
, DFrame
) to make use of these utilities.One sticking point has been decided how on what format to use as a pure python tabular format (i.e if pandas is not available). The question has been whether Table
or NdElement
should be accepted as .data
but we've now decided on the following scheme:
.data
attribute can be 1. a Numpy array 2. a pandas DataFrame
3. an NdElement
4. the .data
attribute of whatever other element is passed into the constructor (i.e. back to one of the first three options).Table
will no longer inherit from NdElement
(but it will inherit from Element
at some level). This is a generalized version of Table
(that also replaces DFrame
) and will support the three formats mentioned above.NdElement
will be given a rich representation just like Table
(i.e. it will inherit from Tabular
) but unlike Table it will not support DataFrame
or numpy arrays in the constructor. It will be the version of Table
that does not require third-party libraries (i.e pure python) and will have no subclasses (Bars
will be updated accordingly).NdElement
if unified with the other base classes as proposed above should be renamed. We've proposed the name Columnar
(or Columns
?) as that is the main feature of data, no matter the details of the data structure used. I think I vote for Columns
as it will be a proper element and should be a noun.ArrayLike
?) for Raster
types that support numpy and dask arrays.This approach avoids unnecessary nesting of the data held by the .data
attribute and resolves the issue of how Table
can work without pandas being available (which would be an issue if Table
were allowed to be .data
).
There is a lot to think about and we will need to make sure our API abstracts over the different data types while remaining rich enough to be useful in practice (e.g for implementing operations).
Sounds good. I too prefer Columns to Columnar.
Will it be clear how to add other .data formats in the future, apart from the three above? And is option 1 really required to be a numpy array, or can it be anything supporting a numpy array's interface? E.g. blaze tries to support that.
Will it be clear how to add other .data formats in the future, apart from the three above?
Yes, there will be an Interface
base class that determines what needs to be implemented e.g for xray or maybe unit arrays (from astropy).
can it be anything supporting a numpy array's interface? blaze tries to support that.
From my understanding, you need blaze+dask for out-of-core arrays. I have a prototype of this but unfortunately, it isn't quite a numpy interface - if you have a dask array you need to call the .compute
method to load the array-like object from disk. See here for the prototype utility I was using to handle this.
In other words, we can support dask arrays as .data
with the proposed system, but there would need to be some logic (hopefully very little!) to call compute
in the appropriate interface class.
Finally, I am going to assign this to the 1.4.0 milestone as I think this refactor will greatly expand the generality and power of HoloViews. It should be possible to make all this work without breaking pickles (very important!) and I don't think we should delay getting this implemented.
Edit: Using dask arrays by calling compute
early on in the interface class would be an easy thing to do, but this doesn't use dask very well. A proper Interface class would be a better idea that uses the dask API to make use of all the efficient out-of-core computation it offers.
It is also worth mentioning that Philipp and I agree that these Interface
classes would be a very natural place to add unit support at some point. We should keep this in mind as an additional goal for this system that we will want to think about later on.
This has just been merged, closing the issue.
Now that we've made HoloViews plotting backend independent it's time to talk about generalizing the data backends. Reviewing the current situation I have to say we haven't done quite as good a job as we could have done but all the foundations are there to unify things very nicely and iron out kinks like the lack of support for categorical or date types in various charts. As part of the dataframe refactor we've already begun looking at this but it's been more of an ad-hoc way of patching in support for dataframes.
Today it occurred to me that we already have three different storage formats for table-like data and a lot of code built up to convert and operate on the different data types. They include:
NdElement
implements a table as an NdMapping, i.e. a fancy dictionary. For large tables this is very slow.Chart
objects while never displayed as such are effectively multiple columns stored in an NxD numpy array, where N is the number of samples and D the number of dimensions/columns.DFrame
as it's name suggests simply uses a pandas dataframe as the underlying datasource.It would be nice if all subclasses of these Element types could be unified to use any of these datasources. Currently different Element types use different data formats and it's all a bit of a mess and not very flexible. If we are successful with this the following Elements will interchangeably support all three data formats: Curve, Points, VectorField, Bars, Table, ErrorBars, Spread, Scatter, Scatter3D, Distribution, TimeSeries and with some minor adjustments Histogram.
I'm not entirely certain about the mechanism yet but I envision that a we introduce a new wrapper class which dispatches any method calls to one of the three baseclasses depending on the type of the data. We would provide a common API for each of the backends so custom Elements, plotting code and the user can access the data in very well defined ways independent of the data backend. Luckily we've generally made sure that such a common API exists even if we haven't used it consistently.
The current API for Element types is this:
dimension_values
: Returns an array, list or pandas series of one columnrange
: Returns the minimum and maximum values along a dimension excluding NaNs.select
: Same as__getitem__
but works for one or multiple dimensions.dframe
: Converts data to a dataframecollapse_data
: Applies a function across a list of Element.data attributesreduce
: Applies a reduce function across one or multiple axessample
: Returns a Table with of the samples specified as listsgroupby
: Groups by the values along one or multiple dimensions returning an NdMapping type indexed by the grouped dimensions with Elements of the same type as values. (not implemented for Charts)reindex
: Reorder specified dimensions dropping unspecified dimensions. (not implemented for Charts)First of all I would like a better name for
dimension_values
, it's one of the most important methods as it returns data along one column/dimension the obvious suggestion would bevalues
but that clashes with the dict interface of NdMapping types. Would it be terrible to let values accept an optional argument to specify the dimension, otherwise returning all value dimensions like it does currently?Secondly I would suggest a few additions to the API, mainly data conversion, which simply clone the Element but convert between the three data formats, something like:
Conversion between Element types would now be trivial since they support the same data formats and I would also deprecate the
table
anddframe
methods.Finally I'd like to suggest that we also have a common NumPy like indexing interface, for the array format it would just index into the data directly, the dataframe version can just use the df.ix[idx] interface and the implementation for the NdMapping based tables is also trivial. For some operations this is going to be significantly faster than going via our value based indexing and it would be nice to do this in a backend agnostic way.
The great thing about this is that most of the implementation for each of these data backends is already there in the
Chart
,NdElement
andDFrame
classes. I think unifying them in this way will really clean up our data API and allow us to do some more powerful things with different interfaces in future. Once we've figured this out we'll have to consider how to extend it to other Element types like Paths and Raster types.In terms of timeline, I think this could be done in one or two days of concentrated work. I think it might even make sense to merge our intermediate attempt at integrating dataframe support as soon as we're satisfied it works and then carry this refactor out for a v1.5 release.
Edit:
The API is actually a little bit larger than what I've described above, as it will necessarily include the NdMapping API:
While
keys
won't usually be helpful,items
will basically be the API to convert to anNdMapping
like representation, and bothdrop_dimension
andadd_dimension
will be useful. Having get and pop on Chart types is somewhat annoying though.