pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.57k stars 17.9k forks source link

Easier subclassing #60

Closed yonatanf closed 10 years ago

yonatanf commented 13 years ago

Hi,

Currently, sub-classing pandas objects is not as easy as it could be. This is the result of many methods explicitly creating specific classes. E.g. the '_combine_const' method from 'matrix' explicitly returns a DataMatrix object: def _combine_const(self, other, func): if not self: return self

    # TODO: deal with objects
    return DataMatrix(func(self.values, other), index=self.index,
                      columns=self.columns)

Therefore, arithmetic operations (and some other operations) performed on a new class MyDataMatrix, inheriting form DataMatrix, return a DataMatrix object, rather than a MyDataMatrix object.

I can get around the problem using the ugly hack of overriding the problem methods, and forcing the output to the new class. E.g., def _combine_const(self, other, func): temp = super(MyDataMatrix, self)._combine_const(other, func) temp.class = self.class return temp However, I feel it would be easier to just change the original methods to return the class of the calling object, rather than a fixed class. E.g., def _combine_const(self, other, func): if not self: return self

    # TODO: deal with objects
    return self.__class__(func(self.values, other), index=self.index,
                      columns=self.columns)

Hope this makes sense, and I'm not missing something. Thanks for a very useful package!

wesm commented 13 years ago

since I've recently refactored DataFrame and DataMatrix into a single class, this should be much easier / consistent to do now. I'm curious how and why you're creating a subclass? Is it to add functionality that's not there? It might be easier in that case to monkey-patch in methods like:

DataFrame.new_method = new_method

if you give me some use cases it would be helpful!

yonatanf commented 13 years ago

Thanks for the quick reply. Glad to hear this is easier in the latest version.

I'm subclassing since I have a particular kind of dataset I want to work with. Namely, I'm creating an ecological survey class, inheriting from DataMatrix, where rows correspond to samples and columns to organisms (e.g. how many lions, zebras and giraffes where observed at 5 different water holes). The goals of creating a dedicated subclass are:

  1. Add custom methods like custom normalization and filtering. Monkey-patching works well for this.
  2. Add row/col metadata attributes (e.g. temperature and pH of sampling sites). These should be consistent with the DataMatrix, in the sense that the indexing is preserved during manipulations like transposing, sorting, reindexing, etc'. With the metadata in place, I could add methods to sort/group/filter/etc' by it. I think monkey patching is less appropriate for this purpose, as it requires overriding some of DataMatrix's methods. I could add additional rows/cols containing the metadata information. However, these often contain non-numerical data, making this option a little awkward to work with.

Hope this is clearer now.

wesm commented 13 years ago

I pushed some changes today that should make subclassing easier, maybe update to the git HEAD and give it a shot.

I don't have a good understanding of your point 2). Could you provide a bit more detail? Not immediately clear to me why you can't just store that data as columns in the DataFrame (unless you need to do row-oriented computations on the data-- though some of this is possible even if you have heterogeneously-dtyped columns). A lot of people do that. I just want to be sure that your need to subclass to solve point 2) is not caused by some deficiency in the data structure.

Further, I would likely be interested in adding methods like you're describing that are sensitive to metadata-- if it's generic enough could be a nice addition. You can already do group by with a column containing an indicator of sorts, e.g.:

for name, group in df.groupby('column_name'):
    ...

etc.

yonatanf commented 13 years ago

Thanks, I'll update an give it a try. Everything below refers to v. 0.3.0.

My motivation for adding metadata attributes, rather than appending additional cols/rows is the following:

a) With mixed dtypes, non-numeric cols get dropped when an arithmetic operation i performed. For example, in the following code snippet, column 'meta' will be missing from y.

x = DataMatrix([[0,1,2]], index=['ro'],  columns=['c0','c1','c2'])
x['meta'] = 'bar'
y = x + 1

This is easy to fix, or at least give the option to set the behavior to just ignore non-numeric data, rather than omitting them.

b) I often wish to treat metadata differently than the 'main' data, be it numeric or not. For example, consider the following object:

x = DataMatrix([[100, 500, 20],[200,50,12]], index=['New York', 'Boston'],  columns=['dogs','cats','Mean Temperature [c]']),

giving the number of dogs and cats in different cities, and the mean annual temperature in these city. One may want to normalize the number of dogs/cats by their total number in all cities, but keep the unnormalized temperatures. Currently, this would require remembering which columns are metadata columns, and treat these differently, which can be a hassle, and lead to mistakes.

Currently, I'm supporting metadata by adding attributes to the DataFrame/Matrix objects, which would contain metadata objects. These objects would themselves be DataFrame/Matrix objects. I've got a very crude version of this working, but I'll update it to the latest pandas version, and add methods for sorting/filtering/grouping by metadata.

wesm commented 13 years ago

Cool let me know how it goes. I agree that maybe the sensible default behavior when you do arithmetic with mixed dtypes is to just ignore the non-numeric data and let everything else pass through. There are some issues I haven't thought much about like, what would DataFrame + Series yield if the DataFrame contains mixed-dtype data? In R land this isn't exactly a solved problem and is very much DIY, but crafting some kind of flexible solution would be nice. Like you might want something like:

grouped = df.groupby('metacol1')
transformed = grouped.transform({'col1' : do_something,
                                'col2' : do_something_else})

so you can selectively apply transforms and leave the other columns unaltered.

lodagro commented 12 years ago

Currently both Series and DataFrame have a _constructor property. However this is not used consistently, which is needed for subclassing.

The way i understand _constructor to be used when subclassing is as following:

class MySeries(pandas.Series):
    @property
    def _constructor(self):
        return MySeries

Series:

DataFrame:

Why do i want to subclass?

wesm commented 12 years ago

You don't need to subclass to do the first two things on your list--you can just add methods to Series, e.g.:

def f(self, *args, **kwargs):
    # do something

Series.method_name = f

This goes for modifying describe also.

The last item is a bit trickier. It indeed might be nice to add more metadata to DataFrame-- I agree it should be easier to subclass, though. Subclassing DataFrame should be much more straightforward than subclassing Series, it's just a matter of consistency. The only real way is to write a test suite for a subclassed DataFrame and start hammering down all the issues.

BrenBarn commented 12 years ago

Was there a final decision on this? I see that the issue was closed but I don't see an explanation. I too would like to see DataFrames (and other pandas classes) become more amenable to subclassing. Monkeypatching seems much more hackish, and also doesn't allow for the case where you want custom initialization. The way pandas is now, with the class names hard coded in individual methods, instead of using type(self) or similar, is rather fragile.

It would really be nice if it were possible to subclass pandas classes in such a way that everything transparently "worked" with creating the custom subclasses instead of the basic Pandas classes. This would make it possible to create custom DataFrames for different applications. These could, for instance, store extra metadata or computed statistics automatically.

For my own case, I was hoping to extend DataFrame to allow a more succinct subsetting syntax, where essentially df._ColName(val) is shorthand for df.ix[df['ColName']==val], and df._ColName(func) is shorthand for df.ix[func(df['ColName'])]. I have a little data-frame library that I wrote myself that uses this approach, and it's very handy for interactive exploration and slicing of datasets. However, I wasn't able to accomplish this in a useful way, because indexing into my subclass returns a pandas DataFrame and not an instance of my subclass. It would be great if pandas allowed this.

wesm commented 12 years ago

My position on this is that I would like Series/DataFrame/etc. to be easier to subclass, but it's not a priority for me and I can't afford to spend any time on it for the foreseeable future. If there were some financial support for it, that would be a different story.

That being said, I will happily accept pull requests or otherwise code contributions that make the changes necessary to make DataFrame more amenable to subclassing.

BrenBarn commented 12 years ago

It seems there some changes that could be made pretty easily. As lodagro mentioned in an earlier comment, DataFrame has a _constructor method which appears to be set up to parameterize self-instantiation, but it's not used in most cases. There are a couple methods that call self._constructor(blah) but most just have DataFrame(blah) hard-coded. Within the DataFrame class, a simple replacement of all calls to DataFrame() with calls to self._constructor() would make a considerable difference.

dandavison commented 12 years ago

I started to create a subclass of Series to model discrete probability distributions before coming across this problem. It can definitely be done with composition instead, but I do think not being able to subclass easily is a trap which will surprise users.

@wesm I see your post above regarding priority for this; just adding a +1 to show it would be appreciated if someone does it.

yarivm commented 11 years ago

+1 vote to make DataFrame easier to subclass ASAP. This issue should be opened.

maaku commented 11 years ago

This was really confusing for me. I wasted an hour and half trying to figure out why MyClass(pd.Series) constructor was returning pd.Series instances. Should have done a google search first.

In my case, I'm modifying the behavior of Series for an accounting application I am writing, modifying in particular how a couple of arithmetic operators function. I ended up tucking the array into instance variable instead, and implementing the arithmetic operators that I needed.

It would have been much more Pythonic if pandas used self.__class__() to create new instances.

wesm commented 11 years ago

Converting this to an open issue until someone has a chance to work on it. It's still not a development priority for me

sebpiq commented 11 years ago

Has there been any progress on this ? If not, I can try ... cause I need this as well. Any idea which files will have to be modified ? Is it enough to modify all datastructures (Index, Series, DataFrame, ...), or is there some hidden nasty stuff that'll have to be modified as well ?

jreback commented 11 years ago

Take at look at these PR (not completely merged yet) - they provide experimental support for a generic sub-class of Panel. The panelnd has factory methods to create these classes.

https://github.com/pydata/pandas/pull/2407 (not merged yet) - docs and such https://github.com/pydata/pandas/pull/2242 - the main implementation

sebpiq commented 11 years ago

I feel dumb ... I don't even manage to build the library :

> python setup.py build_ext --inplace
running build_ext
building 'pandas.index' extension
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/home/spiq/pandas/env/local/lib/python2.7/site-packages/numpy/core/include -Ipandas/src/klib -Ipandas/src -I/usr/include/python2.7 -c pandas/index.c -o build/temp.linux-i686-2.7/pandas/index.o
gcc: error: pandas/index.c: No such file or directory
gcc: fatal error: no input files
compilation terminated.
error: command 'gcc' failed with exit status 4

Any pointer ?

jreback commented 11 years ago

make sure you update to the current master; a lot of things moved around recently (esp the cython/c code)

sebpiq commented 11 years ago

Arr ... that's taking too much time. I was a bit too optimistic I guess, thinking that I could change everything without even knowing the code. For what it's worth, here is what I did : 98e1fadbf9395e88115044db3e5fc8e8f3a46012 there is ~ 5 errors which I couldn't solve, and ~15 failures in the tests. Sorry for all the fuss. I'll find a hack for my thing for the moment, but I'll follow the progress of this.

jreback commented 10 years ago

closing...this is pretty easy now