pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.73k stars 17.95k forks source link

DEPR: __finalize__ and _metadata #51280

Open jbrockmendel opened 1 year ago

jbrockmendel commented 1 year ago

We have some support for propagating metadata, but it is incomplete (38 xfails in tests/generic/test_finalize.py) for plain methods and has no real prospect for getting robust support for concat/merge etc.

At the same time, there is a real performance cost to calling __finalize__ everywhere, in an extreme case like #46505 these calls take up a quarter of the runtime.

AFAICT this is useful for a fraction of users and perf-hurting for everyone else. Do we have any idea how common it is to rely on these?

cc @MarcoGorelli

MarcoGorelli commented 1 year ago

~attrs already says~

~attrs is experimental and may change without warning.~

~TBH I'd just remove it as a breaking change~

EDIT: as clarified below, although __finalize__ does set attrs, this issue is specifically about the former. See https://github.com/pandas-dev/pandas/issues/52166 for discussion around attrs

MarcoGorelli commented 1 year ago

~> 38 xfails in tests/generic/test_finalize.py~

~And that's without counting those which are failing, but which the test isn't catching because it's written incorrectly: https://github.com/pandas-dev/pandas/issues/49916~

~> Do we have any idea how common it is to rely on these?~

~From my scrape of the 100 most upvoted Python notebooks on Kaggle, it appears exacty 0 times~

~If anyone's using it, then their code's already broken as attrs never worked to begin with~

EDIT: as clarified below, although __finalize__ does set attrs, this issue is specifically about the former. See https://github.com/pandas-dev/pandas/issues/52166 for discussion around attrs

mroeschke commented 1 year ago

We do use this in the interchange protocol:

https://github.com/pandas-dev/pandas/blob/180d81ff85732057e407426faf05020044e23ed6/pandas/core/interchange/from_dataframe.py#L135

jorisvandenbossche commented 1 year ago

I think this is mixing different things: __finalize__ and _metadata are long-standing documented features available for developers subclassing DataFrame/Series (https://pandas.pydata.org/docs/dev/development/extending.html#subclassing-pandas-data-structures, __finalize__ actually isn't mentioned, but _metadata is and works through __finalize__ propagating it), while attrs is a newer (officially still experimental?), end-user facing attempt to provide some form of metadata.

Do we have any idea how common it is to rely on these?

From my scrape of the 100 most upvoted Python notebooks on Kaggle, it appears exacty 0 times

For the __finalize__/_metadata, you don't have to look for public notebooks, but for packages depending on pandas that implement subclasses. And to give one example: GeoPandas makes use of both __finalize__ and _metadata.

MarcoGorelli commented 1 year ago

Thanks @jorisvandenbossche !

A question is have is - is GeoPandas affected by the many bugs in __finalize__?

EDIT: I'll take a look at how geopandas uses them, let's not rush anything

jbrockmendel commented 1 year ago

Looks like geopandas uses __finalize__ mostly for merge/concat, which is not a place with an easy "ask them to do X instead". The _metadata they define is _geometry_column_name which looks like it has a reasonable default value that users can override. IIUC this is used to check that when you concat/merge you don't end up with duplicates of that particular label.

Also our own flags is propagated via __finalize__. The only flag is allows_duplicate_labels. I don't see a clear way to deprecate __finalize__ without also deprecating flags.

I'd be OK with deprecating flags and having allow_duplicate_labels checking move to third-party validation libraries.

jorisvandenbossche commented 1 year ago

Looks like geopandas uses __finalize__ mostly for merge/concat,

GeoPandas customizes __finalize__ for merge/concat, but we still make use of the default __finalize__ (propagating _metadata) for many other operations where pandas ensures __finalize__ is called.
So we basically use __finalize__ for any method on a GeoDataFrame that we inherit from pandas.DataFrame and that returns a new GeoDataFrame (to ensure that the geometry name information is preserved).

A question is have is - is GeoPandas affected by the many bugs in __finalize__?

Occasionally we run into an issue where the metadata is not properly propagated from eg a missing __finalize__ call, but then we can fix this, and go on. But generally it's not a big problem at all. GeoPandas is a happy user of this feature. (and it's not because a feature has bugs that that in itself is a reason to throw it away, otherwise we can just stop with developing the full of pandas ;))

Performance is certainly an important aspect that I don't want to just ignore. But I think we should then do some more analysis of the extent of the problem in actual use cases, to know if this is actually a problem (or how big it is, and to see if there are ways to address it without throwing away the feature in its entirety).
https://github.com/pandas-dev/pandas/issues/46505 is one such example, but I want to note that I would consider that as quite a corner case that also slowed down for other reasons than only __finalize__ (the reproducer there uses 500_000 groups each consisting of just 1 row, so creating a lot of tiny DataFrame or Series objects, exaggerating the slow down you will typically see). And one that we can also already improve without removing __finalize__ (will comment on that issue about that).

jbrockmendel commented 1 year ago

Another option (i think this was @phofl's idea) would be to keep __finalize__ but make it a no-op, so libraries could override it if they need the metadata/attrs/flags bits.

jorisvandenbossche commented 1 year ago

I would propose to let this issue focus on the mechanics of how we propagate some information (using __finalize__ and _metadata), and whether we want to change that, how it influences subclasses, .. (what most of the discussion above was about, I think). So we can have a separate discussion about the actual user-facing features of attrs and flags. (I know they are related, but for example if we evaluate attrs and flags on its merits, we might decide they are not worth including in pandas, without that we necessarily deprecate __finalize__ and _metadata for subclasses. And at the same time (although probably unlikely) we could also decide to remove the subclassing functionality while keeping flags and attrs, since we are free to implement those under the hood however we want, they don't need to use what is currently __finalize__)

TomAugspurger commented 1 year ago

Do we have any user feedback that metadata propagation causing issues? Or that it's not worth the slowdown they do incur?

It seems like xarray manages to do this successfully. @dcherian have you heard feedback from users about metadata propagation being a bottleneck in xarray?

jorisvandenbossche commented 1 year ago

Do we have any user feedback that metadata propagation causing issues? Or that it's not worth the slowdown they do incur?

I personally have never observed this to be a bottleneck in real-world benchmarks that I encountered myself. We do have a recent report about this at https://github.com/pandas-dev/pandas/issues/46505. But I think one could argue that this example is showing this is not necessarily a big problem: the specific example is a good "worst case" example (it is doing a groupby using 500,000 groups each consisting of just 1 row, so creating a lot of tiny DataFrame or Series objects, exaggerating the slow down you will typically see from the constructor overhead), it is also an example of something that is relatively slow anyway (applying a python function and not using one of the built-in cythonized options), and so in this worst case scenario the finalize/metadata propagation is then "only" giving a 30% overhead (10 -> 13s, based on https://github.com/pandas-dev/pandas/pull/51109#issuecomment-1483371816).

If performance is the main argument that we would want to deprecate the related features to metadata propagation (attrs/flags, https://github.com/pandas-dev/pandas/issues/52165 and https://github.com/pandas-dev/pandas/issues/52166, where it is currently the only argument), I think we need some more investigation / proof that this is actually a problem.

It could also be useful to have some more detailed profiling of what aspect of it takes time / could be optimized (eg if it's mostly the flags propagation, we maybe could have a private attribute that is set if flags has been manually set and isn't the default, so we can avoid flags propagation in finalize in the default case)

jbrockmendel commented 1 year ago

It isn't hard to come up with examples where __finalize__ takes a significant portion of the runtime. e.g. ser = pd.Series(range(300_000)); %prun -s cumtime for n in range(10**4): ser[:30] clocks __finalize__ at just shy of 14% of the runtime.

But the issue isn't that its a big overhead it is that it is an everywhere overhead. And its one we incur in order to support a feature that is not-- and has no real prospect of becoming-- fully functional.

It could also be useful to have some more detailed profiling of what aspect of [__finalize__] takes time / could be optimized

Related: #52183 saves a little bit by checking _attrs/_flags instead of attrs/flags.

dcherian commented 1 year ago

It seems like xarray manages to do this successfully. @dcherian have you heard feedback from users about metadata propagation being a bottleneck in xarray?

Not a performance bottleneck no.

The issues are around propagating attributes, and solving attribute conflicts when merging/concatenating.

jorisvandenbossche commented 1 year ago

It isn't hard to come up with examples where __finalize__ takes a significant portion of the runtime. e.g. ser = pd.Series(range(300_000)); %prun -s cumtime for n in range(10**4): ser[:30] clocks __finalize__ at just shy of 14% of the runtime.

To be explicit: 14% of something that takes 25µs, so we are talking about around 4 microseconds. So while being a significant portion of the construction time, it is insignificant overhead in most cases. Only in corner cases where one creates a huge amount of tiny objects (like the one mentioned above), this will become actually significant in absolute time. And I don't want to disregard such cases, but we have to be honest that such cases are slower anyhow, also for other reasons.

And to repeat myself from above (https://github.com/pandas-dev/pandas/issues/51280#issuecomment-1484680456): there might be ways that we could further reduce this overhead inside __finalize__ without throwing away the feature.

And it's one we incur in order to support a feature that is not-- and has no real prospect of becoming-- fully functional.

Whatever you think of it, it is a feature that is already useful for many of our users, right now. For subclasses this has already been functional for years, for the public attrs/flags it is also functional in many basic cases and there has been a lot of work to improve this if I go on the activity and linked PRs in https://github.com/pandas-dev/pandas/issues/28283

MarcoGorelli commented 1 year ago

Would it be possible (or necessary?) to use __finalize__ to propagate sortedness flags, which could be used to fast-path some algorithms? E.g. in resample, instead of checking ax.is_monotonic_increasing (which may not already have been cached), just checking the flag might save time and make up for any __finalize__ overhead?

jbrockmendel commented 1 year ago

It's possible, but I'm skeptical that this is the right tool for the job. 5ish years ago I implemented Series/DataFrame subclasses that propagated column-specific metadata and it was a giant hassle. I concluded at the time that doing this at the EA level made more sense

Haeri commented 1 year ago

I tried attr, finalize, _metadata and subclasses yet nothing seems to work. Does anyone have a working solution as of today for attaching an attribute to a series?

import pandas as pd

data = {
    'name': ['A', 'B', 'C'],
    'age': [1, 2, 3],
    'country': ['CH', 'DE', 'GB']
}

class SubclassedSeries(pd.Series):
    _metadata = ["id"]

    @property
    def _constructor(self):
        return lambda *args, **kwargs: SubclassedSeries(*args, **kwargs).__finalize__(self)

    @property
    def _constructor_expanddim(self):
        return lambda *args, **kwargs: SubclassedDataFrame(*args, **kwargs).__finalize__(self)

    def __init__(self, *args, **kwargs):
        self.id = kwargs.pop('id', None)
        super().__init__(*args, **kwargs)

class SubclassedDataFrame(pd.DataFrame):
    _metadata = ["id"]

    @property
    def _constructor(self):
        return lambda *args, **kwargs: SubclassedDataFrame(*args, **kwargs).__finalize__(self)

    @property
    def _constructor_sliced(self):
        return lambda *args, **kwargs: SubclassedSeries(*args, **kwargs).__finalize__(self)

df = SubclassedDataFrame(data)
for idx, col in enumerate(df.columns):
    df[col].id = "xxx" + str(idx)
    df[col].attrs = {"id": "xxx" + str(idx)}

print("------------- Before Rename -------------")
for col in df.columns:
    print(type(df[col]))
    print(df[col].attrs)
    print(df[col].id)

df.rename(columns={'country': 'countries'}, inplace=True)

print("------------- After Rename in place-------------")
for col in df.columns:
    print(type(df[col]))
    print(df[col].attrs)
    print(df[col].id)

this gives:

------------- Before Rename -------------
<class 'transform.SubclassedSeries'>
{'id': 'xxx0'}
xxx0
<class 'transform.SubclassedSeries'>
{'id': 'xxx1'}
xxx1
<class 'transform.SubclassedSeries'>
{'id': 'xxx2'}
xxx2
------------- After Rename inplace-------------
<class 'transform.SubclassedSeries'>
{}
None
<class 'transform.SubclassedSeries'>
{}
None
<class 'transform.SubclassedSeries'>
{}
None

I am trying to establish some sort of column lineage and would be nice to attach a unique id to a series such that even after a rename I could map the new series back to the original column.

Flix6x commented 1 year ago

Does anyone have a working solution as of today for attaching an attribute to a series?

https://github.com/SeitaBV/timely-beliefs/blob/3675469f0f4128e698335e47432d14307a369e40/timely_beliefs/beliefs/classes.py#L661

Haeri commented 1 year ago

@Flix6x Thanks for the link but there is a lot of code and I don't fully understand which part makes attribute retention work. Care to highlight how it works or give a minimal example?