pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.34k stars 17.81k forks source link

Allow custom metadata to be attached to panel/df/series? #2485

Closed ghost closed 4 years ago

ghost commented 11 years ago

related: https://github.com/pydata/pandas/issues/39 (column descriptions) https://github.com/pydata/pandas/issues/686 (serialization concerns) https://github.com/pydata/pandas/issues/447#issuecomment-11152782 (Feature request, implementation variant)

Ideas and issues:

naught101 commented 10 years ago

See also: https://github.com/Jim-Holmstroem/MetadataDataFrame

JamesRamm commented 10 years ago

I sub-classed DataFrame in order to provide meta-data (such as a name attribute). In order to get around all the methods returning new dataframe objects, I created a decorator to grab the returned df and make it an instance of my sub class. This is rather painful though as it means re-implementing every such method and adding the decorator. e.g:

class NamedDataFrame(DataFrame):

    @named_dataframe
     def from_csv(...):
         return super(NamedDataFrame, self).from_csv(...)
jreback commented 10 years ago

you can see what you can do w.r.t. #6923 , #6927

this is a much harder problem that at first glance.

you don't need to sub-class, just override _metadata and __finalize__ and you can provide support the name attribute.

jason-s commented 9 years ago

@jreback: your comment from #6923:

The entire problem arises from how to combine them.

Imagine we supported this:

s1.filename='a'
s2.filename='b'

what is (s1+s2).filename?

Pandas already has chosen an approach for handling the semantics of metadata in Series: it's how the library handles the name attribute. Personally I don't see why the basic behavior for any metadata attribute shouldn't be any different:

>>> t = np.array([0,0.1,0.2])
>>> s1 = pd.Series(t*t,t,name='Tweedledee')
>>> s2 = pd.Series(t*t,t,name='Tweedledum')
>>> s1
0.0    0.00
0.1    0.01
0.2    0.04
Name: Tweedledee, dtype: float64
>>> s1*2
0.0    0.00
0.1    0.02
0.2    0.08
Name: Tweedledee, dtype: float64
>>> s1+2
0.0    2.00
0.1    2.01
0.2    2.04
Name: Tweedledee, dtype: float64
>>> s1+s2
0.0    0.00
0.1    0.02
0.2    0.08
dtype: float64

>>> s3 = pd.Series(t*t,t,name='Tweedledum')
>>> s1+s3
0.0    0.00
0.1    0.02
0.2    0.08
dtype: float64
>>> s2+s3
0.0    0.00
0.1    0.02
0.2    0.08
Name: Tweedledum, dtype: float64
>>> s1.iloc[:2]
0.0    0.00
0.1    0.01
Name: Tweedledee, dtype: float64

This shows that indexing and operations using a constant preserve the name. It also shows that binary operations between Series preserves the name if both operands share the same name, and removes the name if both operands have different names..

This is a baseline behavior that at least does something reasonable, and if extended to metadata in general, would be consistent with Pandas' existing behavior of the name attribute.

Yeah, in an ideal world we could write a units addon class and attach them to Series and have it do the right thing in handling math operations (require the same units for addition/subtraction, compute new units for multiplication/division/powers, require unitless numbers for most other functions). But right now it would be helpful just to have something basic.

I've checked out the _metadata functionality and it seems like it persists only when using a Series with indexing; addition/multiplication by a constant drop the metadata value. Combination of series into a DataFrame doesn't seem to work properly, but I'm not as familiar with the semantics of DataFrame as I am with the Series objects.

jreback commented 9 years ago

@jason-s

ok, so are you proposing something?

jason-s commented 9 years ago

Yes, but I'm not sure how to translate it from a concept to working Python code.

There is code in pandas.Series that seems to preserve the name attribute in a meaningful way under indexing, binary operation with numeric constants, and binary operation with other Series objects.

Is there any reason why other entries in the _metadata list could not be handled the same way, at least as a baseline behavior?

hughesadam87 commented 9 years ago

Jason,

While I don't have any opinions on what should be in pandas and what shouldn't, I can bring to your attention some workarounds.

First, stephan hoyer has put a lot of work into the xray library ( http://www.slideshare.net/PyData/xray-extended-arrays-for-scientific-datasets-by-stephan-hoyer) which intrinsically supports metadata on labeled arrays. Based on what I've seen from the tutorials, it's the most robust solution to the problem.

Secondly, the geopandas library has a subclassed dataframe which stores metadata. You can probably engineer your own from copying some of their approaches: https://www.google.com/search?q=geopandas&aq=f&oq=geopandas&aqs=chrome.0.57j60l3j0l2.1305j1&sourceid=chrome&ie=UTF-8

Finally, I have a "MetaDataframe" object that's pretty much a hack, but will work in the way you desire. All you have to do is subclass from it, and metadata should persist. EG:

MyClass(MetaDataFrame): ...

You don't need the library it's in, just the class itself: https://github.com/hugadams/pyuvvis/blob/master/pyuvvis/pandas_utils/metadframe.py

While I can't promise it will work correctly for all dataframe functionality, you can implement it in only a few lines. Check out the "SubFoo" class in the metadataframe.py file for an example.

On Tue, Oct 7, 2014 at 12:43 PM, jason-s notifications@github.com wrote:

Yes, but I'm not sure how to translate it from a concept to working Python code.

There is code in pandas.Series that seems to preserve the name attribute in a meaningful way under indexing, binary operation with numeric constants, and binary operation with other Series objects.

Is there any reason why other entries in the _metadata list could not be handled the same way, at least as a baseline behavior?

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/2485#issuecomment-58215786.

Adam Hughes Physics Ph.D Candidate George Washington University

hughesadam87 commented 9 years ago

Sorry, and just to be clear, the GeoPandas object is a subclassed dataframe. The MetaDataframe class is not; it's a composite class that passes calls down to the dataframe. Therefore, while you can subclass it very easily, I can't promise it's going to work perfectly in all use cases. The GeoPandas/XRay solutions are more robust.

On Tue, Oct 7, 2014 at 12:54 PM, Adam Hughes hugadams@gwmail.gwu.edu wrote:

Jason,

While I don't have any opinions on what should be in pandas and what shouldn't, I can bring to your attention some workarounds.

First, stephan hoyer has put a lot of work into the xray library ( http://www.slideshare.net/PyData/xray-extended-arrays-for-scientific-datasets-by-stephan-hoyer) which intrinsically supports metadata on labeled arrays. Based on what I've seen from the tutorials, it's the most robust solution to the problem.

Secondly, the geopandas library has a subclassed dataframe which stores metadata. You can probably engineer your own from copying some of their approaches: https://www.google.com/search?q=geopandas&aq=f&oq=geopandas&aqs=chrome.0.57j60l3j0l2.1305j1&sourceid=chrome&ie=UTF-8

Finally, I have a "MetaDataframe" object that's pretty much a hack, but will work in the way you desire. All you have to do is subclass from it, and metadata should persist. EG:

MyClass(MetaDataFrame): ...

You don't need the library it's in, just the class itself: https://github.com/hugadams/pyuvvis/blob/master/pyuvvis/pandas_utils/metadframe.py

While I can't promise it will work correctly for all dataframe functionality, you can implement it in only a few lines. Check out the "SubFoo" class in the metadataframe.py file for an example.

On Tue, Oct 7, 2014 at 12:43 PM, jason-s notifications@github.com wrote:

Yes, but I'm not sure how to translate it from a concept to working Python code.

There is code in pandas.Series that seems to preserve the name attribute in a meaningful way under indexing, binary operation with numeric constants, and binary operation with other Series objects.

Is there any reason why other entries in the _metadata list could not be handled the same way, at least as a baseline behavior?

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/2485#issuecomment-58215786.

Adam Hughes Physics Ph.D Candidate George Washington University

Adam Hughes Physics Ph.D Candidate George Washington University

jason-s commented 9 years ago

Thanks. I'll take a look, maybe even get my fingers dirty with pandas internals. I do think this should be done right + not rushed into, but I also think that it's important to get at least some useful basic functionality implemented + separate that from a more general solution that may or may not exist.

Something like:

   s1attr = getattr(series1, attrname)
   s2attr = getattr(series2, attrname)
   try:
     sresultattr = s1attr._combine(s2attr, op)
     # if the attributes know how to combine themselves, let them
   except:
     # otherwise, if they're equal, propagate to output
     # user must beware of mutable values with equivalence
     if s1attr == s2attr:
       sresultattr = s1attr
     else:
       sresultattr = None
jreback commented 9 years ago

@jason-s can you show an example of what you are wanting to do? pseudo-code is fine

you can simply add to the _metadata class-level attribute and then it will propogate that attribute.

hughesadam87 commented 9 years ago

Here is a longer discussion of the issue:

https://github.com/pydata/pandas/issues/2485

On Tue, Oct 7, 2014 at 1:12 PM, jason-s notifications@github.com wrote:

Thanks. I'll take a look, maybe even get my fingers dirty with pandas internals. I do think this should be done right + not rushed into, but I also think that it's important to get at least some useful basic functionality implemented + separate that from a more general solution that may or may not exist.

Something like:

s1attr = getattr(series1, attrname) s2attr = getattr(series2, attrname) try: sresultattr = s1attr._combine(s2attr, op)

if the attributes know how to combine themselves, let them

except:

otherwise, if they're equal, propagate to output

 # user must beware of mutable values with equivalence
 if s1attr == s2attr:
   sresultattr = s1attr
 else:
   sresultattr = None

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/2485#issuecomment-58221020.

Adam Hughes Physics Ph.D Candidate George Washington University

jason-s commented 9 years ago

OK. I'll try to put some time in tonight. As the issue feature in github is a little clunky, what I'll probably do is create a sample IPython notebook + publish as a gist.

The _metadata attribute works fine with Series but seems to behave oddly in DataFrame objects.

shoyer commented 9 years ago

@jason-s Based on my experience with xray, the biggest complexity is how you handle metadata arguments that you can't (or don't want) to check for equality, e.g., if the metadata could be a numpy array, for which equality checks are elementwise, or worse, with some missing values (note np.nan != np.nan). Of course, there are work arounds for this sort of stuff but it's pretty awkward.

I'll add more in #8572.

@hugadams Thanks for the xray plug. Next time use my GH handle and github will ping me automatically :).

hughesadam87 commented 9 years ago

Got it, sorry

On Sun, Oct 19, 2014 at 6:12 PM, Stephan Hoyer notifications@github.com wrote:

@jason-s https://github.com/jason-s Based on my experience with xray https://github.com/xray/xray, the biggest complexity is how you handle metadata arguments that you can't (or don't want) to check for equality, e.g., if the metadata could be a numpy array, for which equality checks are elementwise, or worse, with some missing values (note np.nan != np.nan). Of course, there are work arounds for this sort of stuff but it's pretty awkward.

I'll add more in #8572 https://github.com/pydata/pandas/issues/8572.

@hugadams https://github.com/hugadams next time use my GH handle and github will ping me automatically :).

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/2485#issuecomment-59667968.

Adam Hughes Physics Ph.D Candidate George Washington University

bilderbuchi commented 8 years ago

Any news on this issue? I just found myself wishing for the possibility to attach metadata (probably in a dict) to a dataframe.

jreback commented 8 years ago

its certainly possible to add a default propogated attribute like .attrs via the _metadata/__finalize__ machinery. IIRC geopoandas does this.

But would need quite a bit of auditing and testing. You are welcome to have a go. Can you show your non-trivial use case?

bilderbuchi commented 8 years ago

My use case would be similar to what I imagine @hugadams meant when talking about working with spectroscopy results - data that are constant for the whole dataframe, like

nzjrs commented 8 years ago

I have the same use case as @bilderbuchi (recording scientific experimental metadata)

hughesadam87 commented 8 years ago

It's now much easier to subclass a dataframe and add your own attributes and methods. This wasn't the case when I started the issue On Jan 21, 2016 8:09 AM, "John Stowers" notifications@github.com wrote:

I have the same use case as @bilderbuchi https://github.com/bilderbuchi (recording scientific experimental metadata)

  • subject information - genotype, gender, age
  • experiment information - version hashes, config hashes

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/2485#issuecomment-173563903.

nzjrs commented 8 years ago

yeah, but something that round-trips through a vanilla pickled dataframe would be preferable

dacoex commented 8 years ago

Would the features offered by xarray be somthing that can be adopted here? Data Structures

They have data attributes. If pandas could get the same features, this would be great. Unit conversion, unit propagation, etc.

hughesadam87 commented 8 years ago

I think xarrayis what you want.

You may also try this metadataframe class I wrote a few years ago. It may not longer work with pandas versions, but I haven't tried.

https://github.com/hugadams/scikit-spectra/blob/b6171bd3c947728e01413fe91ec0bd66a1a02a34/skspec/pandas_utils/metadframe.py

You should be able to download that file, then just make a class that has attributes like you want. IE

df = MetaDataframe() df.a = a df.b = b

I thought that after 0.16, it was possible to simply subclass a dataframe, right?

IE

class MyDF(DataFrame) self.a = 50 self.b = 20

Or is this not the case?

On Sat, Jan 23, 2016 at 8:28 AM, DaCoEx notifications@github.com wrote:

Would the features offered by xarray be somthing that can be adopted here? Data Structures http://xarray.pydata.org/en/stable/data-structures.html

They have data attributes. If pandas could get the same features, this would be great. Unit conversion, unit propagation, etc.

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/2485#issuecomment-174186659.

Adam Hughes Physics Ph.D Candidate George Washington University

hughesadam87 commented 8 years ago

Here's what I was talking about:

http://pandas.pydata.org/pandas-docs/stable/internals.html#override-constructor-properties

On Sat, Jan 23, 2016 at 1:32 PM, Adam Hughes hugadams@gwmail.gwu.edu wrote:

I think xarrayis what you want.

You may also try this metadataframe class I wrote a few years ago. It may not longer work with pandas versions, but I haven't tried.

https://github.com/hugadams/scikit-spectra/blob/b6171bd3c947728e01413fe91ec0bd66a1a02a34/skspec/pandas_utils/metadframe.py

You should be able to download that file, then just make a class that has attributes like you want. IE

df = MetaDataframe() df.a = a df.b = b

I thought that after 0.16, it was possible to simply subclass a dataframe, right?

IE

class MyDF(DataFrame) self.a = 50 self.b = 20

Or is this not the case?

On Sat, Jan 23, 2016 at 8:28 AM, DaCoEx notifications@github.com wrote:

Would the features offered by xarray be somthing that can be adopted here? Data Structures http://xarray.pydata.org/en/stable/data-structures.html

They have data attributes. If pandas could get the same features, this would be great. Unit conversion, unit propagation, etc.

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/2485#issuecomment-174186659.

Adam Hughes Physics Ph.D Candidate George Washington University

Adam Hughes Physics Ph.D Candidate George Washington University

dacoex commented 8 years ago

I think xarrayis what you want.

So did you want to express that all aiming at using metadata may better use xarray?

shoyer commented 8 years ago

They have data attributes. If pandas could get the same features, this would be great. Unit conversion, unit propagation, etc.

Just to be clear, xarray does support adding arbitrary metadata, but not automatic unit conversion. We could hook up a library like pint to handle this, but it's difficult to get all the edge cases working until numpy has better dtype support.

nzjrs commented 8 years ago

I think 'automatic unit conversion based on metadata attached to series' is a significantly different and more involved feature request to this issue. I hope a simpler upstream supported solution allowing attaching simple text-only metadata can be found before increasing the scope too much.

On 25 January 2016 at 17:14, Stephan Hoyer notifications@github.com wrote:

They have data attributes. If pandas could get the same features, this would be great. Unit conversion, unit propagation, etc.

Just to be clear, xarray does support adding arbitrary metadata, but not automatic unit conversion. We could hook up a library like pint to handle this, but it's difficult to get all the edge cases working until numpy has better dtype support.

— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/2485#issuecomment-174558259.

jreback commented 8 years ago

This is quite simple in current versions of pandas.

I am using a sub-class here for illustration purposes. Really all that would be needed would be adding the __finalize__ to most of the construction methods (this already exists now for Series, but not really for DataFrame).

unambiguous propogation would be quite easy, and users could add in there own __finalize__ to handle more complicated cases (e.g. what would would you do when you have df + df2)?

In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:from pandas import DataFrame
:
:class MyDataFrame(DataFrame):
:    _metadata = ['attrs']
:
:    @property
:    def _constructor(self):
:        return MyDataFrame
:
:    def _combine_const(self, other, *args, **kwargs):
:        return super(MyDataFrame, self)._combine_const(other, *args, **kwargs).__finalize__(self)
:--

In [2]: df = MyDataFrame({'A' : [1,2,3]})

In [3]: df.attrs = {'foo' : 'bar'}

In [4]: df.attrs
Out[4]: {'foo': 'bar'}

In [5]: (df+1).attrs
Out[5]: {'foo': 'bar'}

Would take a patch for this, the modification are pretty straightforward, its the testing that is the key here.

postelrich commented 7 years ago

@jreback is there a generic way to persist metadata amongst all transforms applied to a dataframe including groupbys? Or would one have to go through and override a lot of methods' to call __finalize__?

jreback commented 7 years ago

@postelrich for most/all things, __finalize__ should already be defined (and so in theory you can make it persist attributes). Its not tested really well though.

For Series I think this is quite robust. DataFrame pretty good. I doubt this works at all for groupby / merge / most reductions. Those are really dependent on the __finalize__ (it may or may not be called), that is the simple part. The hard part is deciding what to do.

jbrockmendel commented 7 years ago

I've been working on an implementation of this that handles the propagation problem by making the Metadata object itself subclass Series. Then patch Series to relay methods to Metadata. Roughly:

class MSeries(pd.Series):
    def __init__(self, *args, **kwargs):
        pd.Series.__init__(self, *args, **kwargs)
        self.metadata = SMeta(self)

    def __add__(self, other):
        res = pd.series.__add__(self, other)
        res.metadata = self.metadata.__add__(other)
        return res

class SMeta(pd.Series):
    def __init__(self, parent):
        super(...)
        self.parent = parent

    def __add__(self, other):
        new_meta = SMeta(index=self.index)
        other_meta = [... other or other.metadata or None depending ...]
        for key in self.index:
             new_meta[key] = self[key].__add__(other)

So it is up to the individual MetaDatum classes to figure out how to propagate.

I've generally got this working. The part that I have not gotten working is the desired MFrame behavior df.metadata['A'] is df['A'].metadata. Any ideas on how to make that happen?

JochemBoersma commented 7 years ago

Propagation of attributes (defined in _metadata) gives me some headaches...

Based on the code of jreback, I've tried the following:

from pandas import DataFrame
class MyDataFrame(DataFrame):
    _metadata = ['attrs']

    @property
    def _constructor(self):
        return MyDataFrame

    def _combine_frame(self, other, *args, **kwargs):
        return super(MyDataFrame, self)._combine_frame(other, *args, **kwargs).__finalize__(self)

dfA = MyDataFrame({'A' : [1,2,3]})
dfA.attrs = {'foo' : 'bar'}

dfB = MyDataFrame({'B' : [6,7,8]})
dfB.attrs = {'fuzzy': 'busy'}

dfC = dfA.append(dfB)
dfC.attrs   #Returns error: 'MyDataFrame' object has no attribute 'attrs'
            #I would like that it would be {'foo': 'bar'}

As jreback mentioned: there should be made choices: what to do with the appended atttributes. However: I would be really helped when the attributes of only '''dfA''' simply propagate towards '''dfC'''

EDIT: more headache is more better, it pushes me to think harder :). Solved it, by stealing the finalize solution which GeoPandas provided. finalize works pretty good indeed. However, I'm not experienced enough to perform the testing.

ppwfx commented 6 years ago

Can't we just put metadata in the column name and change how columns are accessed? E.g. ["id"] would internally translate to {"name": "id"}.

Don't know the internals of pandas, so sorry if this might be a little naive. To me it just seems that the column name is really consistent across operations

yarikoptic commented 5 years ago

My use case would be adding a description to "indicator variables" (just 0/1) which are otherwise look like var#1, var#2 etc, and I do not want to pollute those names with potentially long values they actually stand for.

mroeschke commented 4 years ago

I think we have _metadata https://pandas.pydata.org/pandas-docs/stable/development/extending.html#define-original-properties and .attrs defined for this metadata use cases. If these don't sufficiently cover the necessary use cases, new issues can be created about those 2 methods. Closing