Open jbrockmendel opened 1 year ago
For context, this feature was added in pandas 1.0 (https://github.com/pandas-dev/pandas/pull/29062, cc @TomAugspurger).
I personally have no idea how much attrs
specifically is being used since it was introduced, but in general the ability to store metadata is a topic that has come up a lot. From a quick browsing of our issues, some related ones:
attrs
)(from the last two issues, it seems there is certainly some user interest in the specific attrs
feature)
From Joris in https://github.com/pandas-dev/pandas/issues/51280#issuecomment-1484680456
If performance is the main argument that we would want to deprecate the related features to metadata propagation (attrs/flags, https://github.com/pandas-dev/pandas/issues/52165 and https://github.com/pandas-dev/pandas/issues/52166, where it is currently the only argument), I think we need some more investigation / proof that this is actually a problem.
That's been bugging me too. I haven't looked at the performance, but copying the metadata should just be a dictionary merge / update.
At the end of the day we'll be making a value judgement: is the performance cost worth it. We'll need a clearer idea of performance cost.
where it is currently the only argument
The other argument is that attrs/_metadata is only half-implemented, with a bunch of the test_finalize tests xfailed and a bunch more just wrong. And there is no real prospect of getting these fully working.
If we do decide this is worth keeping, we should have Only One way to do it. _metadata and attrs do effectively the same thing in slightly different ways.
If we do decide this is worth keeping, we should have Only One way to do it. _metadata and attrs do effectively the same thing in slightly different ways.
That is not really true I think. attrs
allows users to use this for standard DataFrames, _metadata
allows subclasses to use custom metadata that is not directly exposed to users through attrs
.
And there is no real prospect of getting these fully working.
Personally, this is the argument I find most persuading
I encountered this in the USC contract too, they said they couldn't use attrs
as they'd tried to and it was too unreliable (having tried fixing up the finalize tests, I'm not surprised)
One could make the argument that some feature not working completely isn't a reason to deprecate it, but I'm not sure that's valid if the feature isn't being worked on (by contrast, datetime parsing has bugs, but it's actively being worked on, so the prospect of fixing them is realistic). Given that it's marked as "experimental" anyway, I'd suggest just deprecating/removing it, rather than leaving it hanging around unfinished.
As for users wanting to store metadata - does any other DataFrame library support this? If not, we shouldn't be saying "yes" to everything, especially given how limited maintenance resources are.
As for what users should do - I'd suggest they define their own dataclass where one field is metadata and another is the dataframe, and then take care of how to propagate it themselves
but I'm not sure that's valid if the feature isn't being worked on
To be clear I am not working on this myself, so I don't know the details. But I am not sure that this is true that it is not being worked on: judging by the the activity and linked PRs in https://github.com/pandas-dev/pandas/issues/28283, there is some work going on to improve this? (it might have slowed down the last months, but for example generally speaking for the year 2022, quite some PRs have been merged related to this)
I think the bigger problem is that there is no longer an active champion following up on this within the core team
I can chip away at these as I have free time.
As for users wanting to store metadata - does any other DataFrame library support this? If not, we shouldn't be saying "yes" to everything, especially given how limited maintenance resources are.
xarray does, and I think is a good analog here.
I can chip away at these as I have free time.
Awesome!
I think the bigger problem is that there is no longer an active champion following up on this within the core team
Yeah if someone's willing to step up and champion it (like it looks Tom might be doing?) then I have no objections to salvaging this, apologies for having made some too heavy-handed comments earlier on this
An example of attrs use is one of my little personal projects : https://github.com/chourmo/netpandas It uses it to keep track of the column name with ('from' 'to') to represent a graph structure in a standard dataframe. This is just an example, do not make a decision just based on my use case :)
Another project that subclasses pandas and uses _metadata is https://github.com/theOehrly/Fast-F1.
I have previously worked on some of the missing __finalize__
calls and tests. If you decide that you want to keep this, I can likely offer some time to work on this as well over the next few weeks if you want to get this to a fully working state then.
@chourmo @theOehrly thanks a lot for chiming in! That's useful feedback, and it's good to see real-world examples so we can better evaluate this.
Another project that subclasses pandas and uses _metadata is https://github.com/theOehrly/Fast-F1.
@theOehrly I know you are aware of it, but for the general reader, the issue about subclasses/_metadata
is at https://github.com/pandas-dev/pandas/issues/51280
Yeah if someone's willing to step up and champion it (like it looks Tom might be doing?)
Champion might be a bit strong :) It'll just be an hour or so on random weekend mornings.
Another “using it!” chime.
Our library just converted to using dataframes for ResultSets.
attrs will store things like asc/desc sort order, if a inserted row is “virtual” (unsaved to db), etc.
I like the having a fixed location where users can store their own meta data. But at the same time I think that __finalize__
is not a nice concept, it's brittle.
If we can't make the propagation work, I'd be in favor of keeping attrs
but then drop the propagation. There is IMO still value in having a location to put user data related to the dataframe.
Update: After implementing and using, we only had to reattach attrs once, and it makes sense:
attrs = self.rows.attrs.copy()
row_series = pd.Series(row)
self.rows = pd.concat([self.rows, row_series.to_frame().T], ignore_index=True)
self.rows.attrs = attrs
can we say we agree that we deprecate giving attrs
to __finalize__
but keep the attribute otherwise, i.e. make this a much simpler implementation?
Another user here. We use attrs
to carry along additional metadata e.g. for timeseries.
As for what users should do - I'd suggest they define their own dataclass where one field is metadata and another is the dataframe, and then take care of how to propagate it themselves
This is quite inconvenient. You loose a lot of API. For example I can currently do (s1 + s2) / 2
. For a dataframe or custom class one would have to reimplement all this. Also using the data will become more verbose: s1.dropna()
will become s1.data.dropna()
(or similar). If you mainly use the data and only rarely the attrs, sprinkling in .data
everywhere not very readable or user-friendly.
can we say we agree that we deprecate giving attrs to finalize but keep the attribute otherwise, i.e. make this a much simpler implementation?
If I understand correctly, not handling attrs in __finalize__
would mean no propagation. The great value of attrs
is that it's propagated. Without that usefulness is reduced by 90% because almost all operations create new series/dataframes and immediately loose all the user-set attrs, e.g. just doing a simple dropna()
and your information is gone.
There are two aspects:
Performance: As long as a __finalize__
concept exists, attrs
propagation can be handled in there with almost zero overhead. We're essentially only doing a dict update, which (in the empty case) costs some tens of nanoseconds and is on the same order as a function call. There's likely also some micro-optimization potential here if needed.
Behavior: What should the behavior be when multiple objects are involved (e.g. s1 + s2
or concat()
)? One should specify the desired behavoir (maybe just document the current one as it seems to work good enough for most people?). There may still be places without __finalize__
. These can be fixed as we go. Or one could also document that "many operations will maintain attrs
, but not all" - This relieves the project from the need to handle every non-working case as a bug, and at least I could live with such a weak guarantee. Again, in practice this seems to work mostly good enough already.
I'd be happy to go into discussion what's needed to keep propagated attrs around, and possibly could help out with some work here and there.
Many thanks @timhoffm for your comment!
I'm gonna reverse my previous stance then, it's really not too big of a deal to keep it. Furthermore, since I made my original comment, there have been PRs merged to improve attrs propagation
I'd find it a pretty significant loss of functionality if attrs went away, especially Series-based attrs. Here are just a few ways that it is being used in several of my packages:
To store custom custom rendering options that are set in conjunction with the register_series_accessor
decorator. For example, I register a series accessor called highlight
that lets users highlight chemical substructures in DataFrames in Jupyter. For lack of any better place to put that, I'm using attrs
, since (until a recent change, see below) that data was propagated when the DataFrame was copied but it did not matter to me if it persisted on serialization.
As the probably widest reported use case, for units. In other packages, I register custom accessors that retrieve data from our database (e.g., df.get_toxicity_data(["List", "of", "experiments"])
). I bring back those units and attach them as Series metadata. Even though this would actually be nice to persist, I'm ok with letting my users make sure they standardize units as they wish before they serialize their DataFrames for downstream processing.
I'd be supportive of a _metadata
-like approach for Series (as seems to now be documented for DataFrames). But getting rid of any way to store Series-level DataFrame would really make things sticky. I'm not a big fan of transparently returning a DataFrame with subclassed Series (I guess?) just to keep the metadata in attrs from disappearing.
Note from above:
The "see below" about the recent change is that Series attrs now disappear just when calling DataFrame.head()
or DataFrame.tail()
. I'm mentioning this in another open issue - but maybe worth mentioning here just to comment on the current state of attrs.
df = pd.DataFrame({"MySeries": [1, 2, 3]})
df.MySeries.attrs["metadata"] = "this is important"
# This prints an empty dictionary: {}
print(df.head().MySeries.attrs)
# This still prints the attrs: {'metadata': 'this is important'}
print(df.MySeries.attrs)
It seems that Copy-On-Write removes some of the utility of attrs. Is there a way to set attrs on a column of a DataFrame?
pd.set_option("mode.copy_on_write", True)
df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
df["a"].attrs["name"] = "x"
print(df["a"].attrs)
# {}
df.loc[:, "a"].attrs["name"] = "x"
print(df["a"].attrs)
# {}
ser = df["a"]
ser.attrs["name"] = 'x'
df["c"] = ser
print(df["c"].attrs)
# {}
I imagine the 3rd example can be made to work with CoW, but not the 1st and 2nd. cc @phofl
Yeah I think your conclusion is correct
IMHO the first two should error out. Setting attrs is a write operation, but we certainly don't want this to make a copy of the dataframe. Furthermore, attrs is a global property of the dataframe and modifying that through a partial view may be confusing. So the only reasonable behavior is to not allow setting attrs on views.
IMHO the first two should error out.
Yeah I think I agree, it is basically another version of chained assignment
Am I correct in understanding that the consensus is now that attrs
is here to stay?
I also use it and would like to increase my reliance on it (provided it doesn't go away).
If it's there to stay, would it be OK to remove the experimental warning in the doc and instead specify when it is not propagated?
Not a pandas core dev, but my take on this is that it's aspirational to support attrs
. However, there are still some rough edges in particular also in connection with copy-on-write, so that it's not a first-class feature yet.
I would characterize it as:
I want to add one item to the list of projects using .attrs
😊
In the spatialdata
library (a framework developed for spatial biology, kind of an extension of microscopy which looks at molecular content in tissues), we use .attrs
to store metadata (simple objects that are JSON-serializable) that is crucial for our data representation.
More precisely, the SpatialData
class is a container for various objects: pd.DataFrame
, xarray.DataArray
, datatree.DataTree
, geopandas.GeoDataFrame
, dask.dataframe.DataFrame
, anndata.AnnData
and all these objects provide a .attrs
(AnnData
provides .uns
, which works mostly as .attrs
).
I discovered this discussion because recently dask.dataframe.DataFrame
dropped the support for .attrs
https://github.com/dask/dask/issues/11146 and the whole library broke. I am now working on a PR to try to restore that functionality. I am happy to read that from this conversation that .attrs
seems to be here to stay! 💯
Not a pandas core dev, but my take on this is that it's aspirational to support
attrs
. However, there are still some rough edges in particular also in connection with copy-on-write, so that it's not a first-class feature yet.I would characterize it as:
* Is here to stay. * Works reasonably well, but be prepared for some limitations / bugs. * Should gradually improve in the long run.
Could we have confirmation by a Pandas core dev that attrs
are staying? For example by formally closing all Issues/PR's associated with the deprecation?
I would like to start using .attrs
for tracking information about data generation and sources, and I'd love to attach them directly to the DataFrame instead of having to maintain extra data / files or using hacky workarounds suggested on StackOverflow such as adding Python attributes to the DataFrame objects.
I've also used .attrs
in several projects, and find it very useful — for units, provenance, dirty-state etc. I don't mind the support being incomplete. Two small additions I'd suggest to make the incomplete support feel more... complete? are:
.set_attrs()
method that allows one-liners like return pd.concat(dfs).set_attrs(dfs[0].attrs)
- If there isn't already, put a terse list in the documentation of which operations support it, and how (by shallow copy, by reference if any).
The Notes section in https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.attrs.html#pandas.DataFrame.attrs is the best we have. It's a user-facing paraphrasing of the implementation: attrs handling is done in __finalize__
, which most operations use (but there may be exceptions, which is why it may be hard to make a definitive list). pd.concat
has defensive special-casing because the dataframe is created from multiple inputs. Copies are always deep.
Discussion broken off from https://github.com/pandas-dev/pandas/issues/51280 PR #52152
Propagation of
attrs
in__finalize__
is a small-but-everywhere performance hit that we should deprecate.