pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.93k stars 18.03k forks source link

DEPR: attrs #52166

Open jbrockmendel opened 1 year ago

jbrockmendel commented 1 year ago

Discussion broken off from https://github.com/pandas-dev/pandas/issues/51280 PR #52152

Propagation of attrs in __finalize__ is a small-but-everywhere performance hit that we should deprecate.

jorisvandenbossche commented 1 year ago

For context, this feature was added in pandas 1.0 (https://github.com/pandas-dev/pandas/pull/29062, cc @TomAugspurger).

I personally have no idea how much attrs specifically is being used since it was introduced, but in general the ability to store metadata is a topic that has come up a lot. From a quick browsing of our issues, some related ones:

(from the last two issues, it seems there is certainly some user interest in the specific attrs feature)

TomAugspurger commented 1 year ago

From Joris in https://github.com/pandas-dev/pandas/issues/51280#issuecomment-1484680456

If performance is the main argument that we would want to deprecate the related features to metadata propagation (attrs/flags, https://github.com/pandas-dev/pandas/issues/52165 and https://github.com/pandas-dev/pandas/issues/52166, where it is currently the only argument), I think we need some more investigation / proof that this is actually a problem.

That's been bugging me too. I haven't looked at the performance, but copying the metadata should just be a dictionary merge / update.

At the end of the day we'll be making a value judgement: is the performance cost worth it. We'll need a clearer idea of performance cost.

jbrockmendel commented 1 year ago

where it is currently the only argument

The other argument is that attrs/_metadata is only half-implemented, with a bunch of the test_finalize tests xfailed and a bunch more just wrong. And there is no real prospect of getting these fully working.

If we do decide this is worth keeping, we should have Only One way to do it. _metadata and attrs do effectively the same thing in slightly different ways.

jorisvandenbossche commented 1 year ago

If we do decide this is worth keeping, we should have Only One way to do it. _metadata and attrs do effectively the same thing in slightly different ways.

That is not really true I think. attrs allows users to use this for standard DataFrames, _metadata allows subclasses to use custom metadata that is not directly exposed to users through attrs.

MarcoGorelli commented 1 year ago

And there is no real prospect of getting these fully working.

Personally, this is the argument I find most persuading

I encountered this in the USC contract too, they said they couldn't use attrs as they'd tried to and it was too unreliable (having tried fixing up the finalize tests, I'm not surprised)

One could make the argument that some feature not working completely isn't a reason to deprecate it, but I'm not sure that's valid if the feature isn't being worked on (by contrast, datetime parsing has bugs, but it's actively being worked on, so the prospect of fixing them is realistic). Given that it's marked as "experimental" anyway, I'd suggest just deprecating/removing it, rather than leaving it hanging around unfinished.

As for users wanting to store metadata - does any other DataFrame library support this? If not, we shouldn't be saying "yes" to everything, especially given how limited maintenance resources are.

As for what users should do - I'd suggest they define their own dataclass where one field is metadata and another is the dataframe, and then take care of how to propagate it themselves

jorisvandenbossche commented 1 year ago

but I'm not sure that's valid if the feature isn't being worked on

To be clear I am not working on this myself, so I don't know the details. But I am not sure that this is true that it is not being worked on: judging by the the activity and linked PRs in https://github.com/pandas-dev/pandas/issues/28283, there is some work going on to improve this? (it might have slowed down the last months, but for example generally speaking for the year 2022, quite some PRs have been merged related to this)

I think the bigger problem is that there is no longer an active champion following up on this within the core team

TomAugspurger commented 1 year ago

I can chip away at these as I have free time.

As for users wanting to store metadata - does any other DataFrame library support this? If not, we shouldn't be saying "yes" to everything, especially given how limited maintenance resources are.

xarray does, and I think is a good analog here.

MarcoGorelli commented 1 year ago

I can chip away at these as I have free time.

Awesome!

I think the bigger problem is that there is no longer an active champion following up on this within the core team

Yeah if someone's willing to step up and champion it (like it looks Tom might be doing?) then I have no objections to salvaging this, apologies for having made some too heavy-handed comments earlier on this

chourmo commented 1 year ago

An example of attrs use is one of my little personal projects : https://github.com/chourmo/netpandas It uses it to keep track of the column name with ('from' 'to') to represent a graph structure in a standard dataframe. This is just an example, do not make a decision just based on my use case :)

theOehrly commented 1 year ago

Another project that subclasses pandas and uses _metadata is https://github.com/theOehrly/Fast-F1. I have previously worked on some of the missing __finalize__ calls and tests. If you decide that you want to keep this, I can likely offer some time to work on this as well over the next few weeks if you want to get this to a fully working state then.

jorisvandenbossche commented 1 year ago

@chourmo @theOehrly thanks a lot for chiming in! That's useful feedback, and it's good to see real-world examples so we can better evaluate this.

Another project that subclasses pandas and uses _metadata is https://github.com/theOehrly/Fast-F1.

@theOehrly I know you are aware of it, but for the general reader, the issue about subclasses/_metadata is at https://github.com/pandas-dev/pandas/issues/51280

TomAugspurger commented 1 year ago

Yeah if someone's willing to step up and champion it (like it looks Tom might be doing?)

Champion might be a bit strong :) It'll just be an hour or so on random weekend mornings.

ssweber commented 1 year ago

Another “using it!” chime.

Our library just converted to using dataframes for ResultSets.

attrs will store things like asc/desc sort order, if a inserted row is “virtual” (unsaved to db), etc.

topper-123 commented 1 year ago

I like the having a fixed location where users can store their own meta data. But at the same time I think that __finalize__ is not a nice concept, it's brittle.

If we can't make the propagation work, I'd be in favor of keeping attrs but then drop the propagation. There is IMO still value in having a location to put user data related to the dataframe.

ssweber commented 1 year ago

Update: After implementing and using, we only had to reattach attrs once, and it makes sense:

attrs = self.rows.attrs.copy()
row_series = pd.Series(row)
self.rows = pd.concat([self.rows, row_series.to_frame().T], ignore_index=True)
self.rows.attrs = attrs
topper-123 commented 1 year ago

can we say we agree that we deprecate giving attrs to __finalize__ but keep the attribute otherwise, i.e. make this a much simpler implementation?

timhoffm commented 1 year ago

Another user here. We use attrs to carry along additional metadata e.g. for timeseries.


Comments on above suggestions concerning removal:

As for what users should do - I'd suggest they define their own dataclass where one field is metadata and another is the dataframe, and then take care of how to propagate it themselves

This is quite inconvenient. You loose a lot of API. For example I can currently do (s1 + s2) / 2. For a dataframe or custom class one would have to reimplement all this. Also using the data will become more verbose: s1.dropna() will become s1.data.dropna() (or similar). If you mainly use the data and only rarely the attrs, sprinkling in .data everywhere not very readable or user-friendly.

can we say we agree that we deprecate giving attrs to finalize but keep the attribute otherwise, i.e. make this a much simpler implementation?

If I understand correctly, not handling attrs in __finalize__ would mean no propagation. The great value of attrs is that it's propagated. Without that usefulness is reduced by 90% because almost all operations create new series/dataframes and immediately loose all the user-set attrs, e.g. just doing a simple dropna() and your information is gone.


Comments on maintaining attrs

There are two aspects:

  1. Performance: As long as a __finalize__ concept exists, attrs propagation can be handled in there with almost zero overhead. We're essentially only doing a dict update, which (in the empty case) costs some tens of nanoseconds and is on the same order as a function call. There's likely also some micro-optimization potential here if needed.

  2. Behavior: What should the behavior be when multiple objects are involved (e.g. s1 + s2 or concat())? One should specify the desired behavoir (maybe just document the current one as it seems to work good enough for most people?). There may still be places without __finalize__. These can be fixed as we go. Or one could also document that "many operations will maintain attrs, but not all" - This relieves the project from the need to handle every non-working case as a bug, and at least I could live with such a weak guarantee. Again, in practice this seems to work mostly good enough already.

I'd be happy to go into discussion what's needed to keep propagated attrs around, and possibly could help out with some work here and there.

MarcoGorelli commented 1 year ago

Many thanks @timhoffm for your comment!

I'm gonna reverse my previous stance then, it's really not too big of a deal to keep it. Furthermore, since I made my original comment, there have been PRs merged to improve attrs propagation

scott-arne commented 11 months ago

I'd find it a pretty significant loss of functionality if attrs went away, especially Series-based attrs. Here are just a few ways that it is being used in several of my packages:

  1. To store custom custom rendering options that are set in conjunction with the register_series_accessor decorator. For example, I register a series accessor called highlight that lets users highlight chemical substructures in DataFrames in Jupyter. For lack of any better place to put that, I'm using attrs, since (until a recent change, see below) that data was propagated when the DataFrame was copied but it did not matter to me if it persisted on serialization.

  2. As the probably widest reported use case, for units. In other packages, I register custom accessors that retrieve data from our database (e.g., df.get_toxicity_data(["List", "of", "experiments"])). I bring back those units and attach them as Series metadata. Even though this would actually be nice to persist, I'm ok with letting my users make sure they standardize units as they wish before they serialize their DataFrames for downstream processing.

I'd be supportive of a _metadata-like approach for Series (as seems to now be documented for DataFrames). But getting rid of any way to store Series-level DataFrame would really make things sticky. I'm not a big fan of transparently returning a DataFrame with subclassed Series (I guess?) just to keep the metadata in attrs from disappearing.

Note from above:

The "see below" about the recent change is that Series attrs now disappear just when calling DataFrame.head() or DataFrame.tail(). I'm mentioning this in another open issue - but maybe worth mentioning here just to comment on the current state of attrs.

df = pd.DataFrame({"MySeries": [1, 2, 3]})
df.MySeries.attrs["metadata"] = "this is important"

# This prints an empty dictionary: {}
print(df.head().MySeries.attrs)

# This still prints the attrs: {'metadata': 'this is important'}
print(df.MySeries.attrs)
rhshadrach commented 10 months ago

It seems that Copy-On-Write removes some of the utility of attrs. Is there a way to set attrs on a column of a DataFrame?

pd.set_option("mode.copy_on_write", True)

df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})

df["a"].attrs["name"] = "x"
print(df["a"].attrs)
# {}

df.loc[:, "a"].attrs["name"] = "x"
print(df["a"].attrs)
# {}

ser = df["a"]
ser.attrs["name"] = 'x'
df["c"] = ser
print(df["c"].attrs)
# {}

I imagine the 3rd example can be made to work with CoW, but not the 1st and 2nd. cc @phofl

phofl commented 10 months ago

Yeah I think your conclusion is correct

timhoffm commented 10 months ago

IMHO the first two should error out. Setting attrs is a write operation, but we certainly don't want this to make a copy of the dataframe. Furthermore, attrs is a global property of the dataframe and modifying that through a partial view may be confusing. So the only reasonable behavior is to not allow setting attrs on views.

phofl commented 10 months ago

IMHO the first two should error out.

Yeah I think I agree, it is basically another version of chained assignment

tfardet commented 6 months ago

Am I correct in understanding that the consensus is now that attrs is here to stay? I also use it and would like to increase my reliance on it (provided it doesn't go away).

If it's there to stay, would it be OK to remove the experimental warning in the doc and instead specify when it is not propagated?

timhoffm commented 6 months ago

Not a pandas core dev, but my take on this is that it's aspirational to support attrs. However, there are still some rough edges in particular also in connection with copy-on-write, so that it's not a first-class feature yet.

I would characterize it as:

LucaMarconato commented 5 months ago

I want to add one item to the list of projects using .attrs 😊

In the spatialdata library (a framework developed for spatial biology, kind of an extension of microscopy which looks at molecular content in tissues), we use .attrs to store metadata (simple objects that are JSON-serializable) that is crucial for our data representation.

More precisely, the SpatialData class is a container for various objects: pd.DataFrame, xarray.DataArray, datatree.DataTree, geopandas.GeoDataFrame, dask.dataframe.DataFrame, anndata.AnnData and all these objects provide a .attrs (AnnData provides .uns, which works mostly as .attrs).

I discovered this discussion because recently dask.dataframe.DataFrame dropped the support for .attrs https://github.com/dask/dask/issues/11146 and the whole library broke. I am now working on a PR to try to restore that functionality. I am happy to read that from this conversation that .attrs seems to be here to stay! 💯

akors commented 3 weeks ago

Not a pandas core dev, but my take on this is that it's aspirational to support attrs. However, there are still some rough edges in particular also in connection with copy-on-write, so that it's not a first-class feature yet.

I would characterize it as:

* Is here to stay.
* Works reasonably well, but be prepared for some limitations / bugs.
* Should gradually improve in the long run.

Could we have confirmation by a Pandas core dev that attrs are staying? For example by formally closing all Issues/PR's associated with the deprecation?

I would like to start using .attrs for tracking information about data generation and sources, and I'd love to attach them directly to the DataFrame instead of having to maintain extra data / files or using hacky workarounds suggested on StackOverflow such as adding Python attributes to the DataFrame objects.

jobh commented 2 weeks ago

I've also used .attrs in several projects, and find it very useful — for units, provenance, dirty-state etc. I don't mind the support being incomplete. Two small additions I'd suggest to make the incomplete support feel more... complete? are:

  1. If there isn't already, put a terse list in the documentation of which operations support it, and how (by shallow copy, by reference if any).
  2. Add a .set_attrs() method that allows one-liners like return pd.concat(dfs).set_attrs(dfs[0].attrs)
timhoffm commented 2 weeks ago
  1. If there isn't already, put a terse list in the documentation of which operations support it, and how (by shallow copy, by reference if any).

The Notes section in https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.attrs.html#pandas.DataFrame.attrs is the best we have. It's a user-facing paraphrasing of the implementation: attrs handling is done in __finalize__, which most operations use (but there may be exceptions, which is why it may be hard to make a definitive list). pd.concat has defensive special-casing because the dataframe is created from multiple inputs. Copies are always deep.