pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.42k stars 17.84k forks source link

QST: Subclassing and concat #35415

Open EmilianoJordan opened 4 years ago

EmilianoJordan commented 4 years ago

I was looking into df.astype not respecting subclassed objects and noticed the issue had been resolved by modifying pd.concat to respect subclassed objects using the following code:

cons = self.objs[0]._constructor

Line 464 PR #29627

cons = self.objs[0]._constructor_expanddim

Line 477 PR #33884 (Set to be released on 1.1.0)

This assumes homogeneous object types, or at the very least that the first object is representative. My assumption was that functions on the pandas namespace did not respect subclassing and only methods on the objects did. Leaving developers to implement their own special use cases for subclassing. This seemed like a good delimitation to me, both logically and programmatically. But, now diving deeper into functions I don't regularly use I see I was mistaken as merge and merge_ordered both respect the left arg's subclassing.

With this in mind, I have some questions.

  1. Should there be more documentation around subclassing? This would include how concat, merge, merge_ordered etc handle subclassing.
  2. If functions on the pandas names space are going to implement varying degrees of support for subclassing then, should there be a constructor kwarg added to functions like read_sql and read_csv?
  3. Does to_hdf not working with subclassing need to be fixed?
jbrockmendel commented 4 years ago

cc @TomAugspurger

TomAugspurger commented 4 years ago

Generally speaking, we're happy to accept PRs for subclassing things as long as they don't affect the typical user's experience. Switching things like DataFrame to use ._constructor is a good example.

Should there be more documentation around subclassing?

Sure. We have a page dedicated to subclassing.

should there be a constructor kwarg added to functions like read_sql and read_csv?

IMO, no. Most users wouldn't ever need it so I wouldn't want it in the signature.

Does to_hdf not working with subclassing need to be fixed?

I'm not familiar with the issue.

mitar commented 3 years ago

I was looking into df.astype not respecting subclassed objects

I looks like it still does not work when astype is provided as a dict: https://github.com/pandas-dev/pandas/issues/40810

EmilianoJordan commented 2 years ago

I'd like to clarify the behavior in my original post for this issue with a code example:

This is intended to show that either the 'left' value or the first value (in the case of concat) is respected in the case of subclassing. In my opinion the better design decision would be to make these operations commutative. This could be accomplished by only returning a subclass when the objects are homogeneous in type and a pandas object should be returned in all other cases.

No matter what decision is made here I believe this should be documented in the subclassing documentation.

import pandas as pd

class SubclassedSeries(pd.Series):
    @property
    def _constructor(self):
        return SubclassedSeries

    @property
    def _constructor_expanddim(self):
        return SubclassedDataFrame

class SubclassedDataFrame(pd.DataFrame):
    @property
    def _constructor(self):
        return SubclassedDataFrame

    @property
    def _constructor_sliced(self):
        return SubclassedSeries

subclassed_series = SubclassedSeries(range(10,20), name='test')
s = pd.Series(range(10,20), name='test')

# pandas.merge respects lef class type
t = pd.merge(subclassed_series, s)
assert isinstance(t, SubclassedDataFrame)

t = pd.merge(s, subclassed_series)
assert not isinstance(t, SubclassedDataFrame)

# pandas concat behavior
t = pd.concat([subclassed_series, s])
assert isinstance(t, SubclassedSeries)

t = pd.concat([s,subclassed_series])
assert not isinstance(t, SubclassedSeries)

t = pd.concat([subclassed_series, s], axis=1)
assert isinstance(t, SubclassedDataFrame)

t = pd.concat([s,subclassed_series], axis=1)
assert not isinstance(t, SubclassedDataFrame)