Closed ghost closed 4 years ago
I sub-classed DataFrame in order to provide meta-data (such as a name attribute). In order to get around all the methods returning new dataframe objects, I created a decorator to grab the returned df and make it an instance of my sub class. This is rather painful though as it means re-implementing every such method and adding the decorator. e.g:
class NamedDataFrame(DataFrame):
@named_dataframe
def from_csv(...):
return super(NamedDataFrame, self).from_csv(...)
you can see what you can do w.r.t. #6923 , #6927
this is a much harder problem that at first glance.
you don't need to sub-class, just override _metadata
and __finalize__
and you can provide support the name attribute.
@jreback: your comment from #6923:
The entire problem arises from how to combine them.
Imagine we supported this:
s1.filename='a' s2.filename='b'
what is (s1+s2).filename?
Pandas already has chosen an approach for handling the semantics of metadata in Series: it's how the library handles the name
attribute. Personally I don't see why the basic behavior for any metadata attribute shouldn't be any different:
>>> t = np.array([0,0.1,0.2])
>>> s1 = pd.Series(t*t,t,name='Tweedledee')
>>> s2 = pd.Series(t*t,t,name='Tweedledum')
>>> s1
0.0 0.00
0.1 0.01
0.2 0.04
Name: Tweedledee, dtype: float64
>>> s1*2
0.0 0.00
0.1 0.02
0.2 0.08
Name: Tweedledee, dtype: float64
>>> s1+2
0.0 2.00
0.1 2.01
0.2 2.04
Name: Tweedledee, dtype: float64
>>> s1+s2
0.0 0.00
0.1 0.02
0.2 0.08
dtype: float64
>>> s3 = pd.Series(t*t,t,name='Tweedledum')
>>> s1+s3
0.0 0.00
0.1 0.02
0.2 0.08
dtype: float64
>>> s2+s3
0.0 0.00
0.1 0.02
0.2 0.08
Name: Tweedledum, dtype: float64
>>> s1.iloc[:2]
0.0 0.00
0.1 0.01
Name: Tweedledee, dtype: float64
This shows that indexing and operations using a constant preserve the name. It also shows that binary operations between Series preserves the name if both operands share the same name, and removes the name if both operands have different names..
This is a baseline behavior that at least does something reasonable, and if extended to metadata in general, would be consistent with Pandas' existing behavior of the name attribute.
Yeah, in an ideal world we could write a units addon class and attach them to Series and have it do the right thing in handling math operations (require the same units for addition/subtraction, compute new units for multiplication/division/powers, require unitless numbers for most other functions). But right now it would be helpful just to have something basic.
I've checked out the _metadata
functionality and it seems like it persists only when using a Series with indexing; addition/multiplication by a constant drop the metadata value. Combination of series into a DataFrame doesn't seem to work properly, but I'm not as familiar with the semantics of DataFrame as I am with the Series objects.
@jason-s
ok, so are you proposing something?
Yes, but I'm not sure how to translate it from a concept to working Python code.
There is code in pandas.Series
that seems to preserve the name
attribute in a meaningful way under indexing, binary operation with numeric constants, and binary operation with other Series
objects.
Is there any reason why other entries in the _metadata
list could not be handled the same way, at least as a baseline behavior?
Jason,
While I don't have any opinions on what should be in pandas and what shouldn't, I can bring to your attention some workarounds.
First, stephan hoyer has put a lot of work into the xray library ( http://www.slideshare.net/PyData/xray-extended-arrays-for-scientific-datasets-by-stephan-hoyer) which intrinsically supports metadata on labeled arrays. Based on what I've seen from the tutorials, it's the most robust solution to the problem.
Secondly, the geopandas library has a subclassed dataframe which stores metadata. You can probably engineer your own from copying some of their approaches: https://www.google.com/search?q=geopandas&aq=f&oq=geopandas&aqs=chrome.0.57j60l3j0l2.1305j1&sourceid=chrome&ie=UTF-8
Finally, I have a "MetaDataframe" object that's pretty much a hack, but will work in the way you desire. All you have to do is subclass from it, and metadata should persist. EG:
MyClass(MetaDataFrame): ...
You don't need the library it's in, just the class itself: https://github.com/hugadams/pyuvvis/blob/master/pyuvvis/pandas_utils/metadframe.py
While I can't promise it will work correctly for all dataframe functionality, you can implement it in only a few lines. Check out the "SubFoo" class in the metadataframe.py file for an example.
On Tue, Oct 7, 2014 at 12:43 PM, jason-s notifications@github.com wrote:
Yes, but I'm not sure how to translate it from a concept to working Python code.
There is code in pandas.Series that seems to preserve the name attribute in a meaningful way under indexing, binary operation with numeric constants, and binary operation with other Series objects.
Is there any reason why other entries in the _metadata list could not be handled the same way, at least as a baseline behavior?
— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/2485#issuecomment-58215786.
Adam Hughes Physics Ph.D Candidate George Washington University
Sorry, and just to be clear, the GeoPandas object is a subclassed dataframe. The MetaDataframe class is not; it's a composite class that passes calls down to the dataframe. Therefore, while you can subclass it very easily, I can't promise it's going to work perfectly in all use cases. The GeoPandas/XRay solutions are more robust.
On Tue, Oct 7, 2014 at 12:54 PM, Adam Hughes hugadams@gwmail.gwu.edu wrote:
Jason,
While I don't have any opinions on what should be in pandas and what shouldn't, I can bring to your attention some workarounds.
First, stephan hoyer has put a lot of work into the xray library ( http://www.slideshare.net/PyData/xray-extended-arrays-for-scientific-datasets-by-stephan-hoyer) which intrinsically supports metadata on labeled arrays. Based on what I've seen from the tutorials, it's the most robust solution to the problem.
Secondly, the geopandas library has a subclassed dataframe which stores metadata. You can probably engineer your own from copying some of their approaches: https://www.google.com/search?q=geopandas&aq=f&oq=geopandas&aqs=chrome.0.57j60l3j0l2.1305j1&sourceid=chrome&ie=UTF-8
Finally, I have a "MetaDataframe" object that's pretty much a hack, but will work in the way you desire. All you have to do is subclass from it, and metadata should persist. EG:
MyClass(MetaDataFrame): ...
You don't need the library it's in, just the class itself: https://github.com/hugadams/pyuvvis/blob/master/pyuvvis/pandas_utils/metadframe.py
While I can't promise it will work correctly for all dataframe functionality, you can implement it in only a few lines. Check out the "SubFoo" class in the metadataframe.py file for an example.
On Tue, Oct 7, 2014 at 12:43 PM, jason-s notifications@github.com wrote:
Yes, but I'm not sure how to translate it from a concept to working Python code.
There is code in pandas.Series that seems to preserve the name attribute in a meaningful way under indexing, binary operation with numeric constants, and binary operation with other Series objects.
Is there any reason why other entries in the _metadata list could not be handled the same way, at least as a baseline behavior?
— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/2485#issuecomment-58215786.
Adam Hughes Physics Ph.D Candidate George Washington University
Adam Hughes Physics Ph.D Candidate George Washington University
Thanks. I'll take a look, maybe even get my fingers dirty with pandas internals. I do think this should be done right + not rushed into, but I also think that it's important to get at least some useful basic functionality implemented + separate that from a more general solution that may or may not exist.
Something like:
s1attr = getattr(series1, attrname)
s2attr = getattr(series2, attrname)
try:
sresultattr = s1attr._combine(s2attr, op)
# if the attributes know how to combine themselves, let them
except:
# otherwise, if they're equal, propagate to output
# user must beware of mutable values with equivalence
if s1attr == s2attr:
sresultattr = s1attr
else:
sresultattr = None
@jason-s can you show an example of what you are wanting to do? pseudo-code is fine
you can simply add to the _metadata
class-level attribute and then it will propogate that attribute.
Here is a longer discussion of the issue:
https://github.com/pydata/pandas/issues/2485
On Tue, Oct 7, 2014 at 1:12 PM, jason-s notifications@github.com wrote:
Thanks. I'll take a look, maybe even get my fingers dirty with pandas internals. I do think this should be done right + not rushed into, but I also think that it's important to get at least some useful basic functionality implemented + separate that from a more general solution that may or may not exist.
Something like:
s1attr = getattr(series1, attrname) s2attr = getattr(series2, attrname) try: sresultattr = s1attr._combine(s2attr, op)
if the attributes know how to combine themselves, let them
except:
otherwise, if they're equal, propagate to output
# user must beware of mutable values with equivalence if s1attr == s2attr: sresultattr = s1attr else: sresultattr = None
— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/2485#issuecomment-58221020.
Adam Hughes Physics Ph.D Candidate George Washington University
OK. I'll try to put some time in tonight. As the issue feature in github is a little clunky, what I'll probably do is create a sample IPython notebook + publish as a gist.
The _metadata
attribute works fine with Series
but seems to behave oddly in DataFrame
objects.
@jason-s Based on my experience with xray, the biggest complexity is how you handle metadata arguments that you can't (or don't want) to check for equality, e.g., if the metadata could be a numpy array, for which equality checks are elementwise, or worse, with some missing values (note np.nan != np.nan
). Of course, there are work arounds for this sort of stuff but it's pretty awkward.
I'll add more in #8572.
@hugadams Thanks for the xray plug. Next time use my GH handle and github will ping me automatically :).
Got it, sorry
On Sun, Oct 19, 2014 at 6:12 PM, Stephan Hoyer notifications@github.com wrote:
@jason-s https://github.com/jason-s Based on my experience with xray https://github.com/xray/xray, the biggest complexity is how you handle metadata arguments that you can't (or don't want) to check for equality, e.g., if the metadata could be a numpy array, for which equality checks are elementwise, or worse, with some missing values (note np.nan != np.nan). Of course, there are work arounds for this sort of stuff but it's pretty awkward.
I'll add more in #8572 https://github.com/pydata/pandas/issues/8572.
@hugadams https://github.com/hugadams next time use my GH handle and github will ping me automatically :).
— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/2485#issuecomment-59667968.
Adam Hughes Physics Ph.D Candidate George Washington University
Any news on this issue? I just found myself wishing for the possibility to attach metadata (probably in a dict) to a dataframe.
its certainly possible to add a default propogated attribute like .attrs
via the _metadata/__finalize__
machinery. IIRC geopoandas does this.
But would need quite a bit of auditing and testing. You are welcome to have a go. Can you show your non-trivial use case?
My use case would be similar to what I imagine @hugadams meant when talking about working with spectroscopy results - data that are constant for the whole dataframe, like
dataframe.columns.name
for this - doesn't feel clean or idiomatic, but sufficient for this one case since I only wanted to attach one string.I have the same use case as @bilderbuchi (recording scientific experimental metadata)
It's now much easier to subclass a dataframe and add your own attributes and methods. This wasn't the case when I started the issue On Jan 21, 2016 8:09 AM, "John Stowers" notifications@github.com wrote:
I have the same use case as @bilderbuchi https://github.com/bilderbuchi (recording scientific experimental metadata)
- subject information - genotype, gender, age
- experiment information - version hashes, config hashes
— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/2485#issuecomment-173563903.
yeah, but something that round-trips through a vanilla pickled dataframe would be preferable
Would the features offered by xarray be somthing that can be adopted here? Data Structures
They have data attributes. If pandas could get the same features, this would be great. Unit conversion, unit propagation, etc.
I think xarrayis what you want.
You may also try this metadataframe class I wrote a few years ago. It may not longer work with pandas versions, but I haven't tried.
You should be able to download that file, then just make a class that has attributes like you want. IE
df = MetaDataframe() df.a = a df.b = b
I thought that after 0.16, it was possible to simply subclass a dataframe, right?
IE
class MyDF(DataFrame) self.a = 50 self.b = 20
Or is this not the case?
On Sat, Jan 23, 2016 at 8:28 AM, DaCoEx notifications@github.com wrote:
Would the features offered by xarray be somthing that can be adopted here? Data Structures http://xarray.pydata.org/en/stable/data-structures.html
They have data attributes. If pandas could get the same features, this would be great. Unit conversion, unit propagation, etc.
— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/2485#issuecomment-174186659.
Adam Hughes Physics Ph.D Candidate George Washington University
Here's what I was talking about:
http://pandas.pydata.org/pandas-docs/stable/internals.html#override-constructor-properties
On Sat, Jan 23, 2016 at 1:32 PM, Adam Hughes hugadams@gwmail.gwu.edu wrote:
I think xarrayis what you want.
You may also try this metadataframe class I wrote a few years ago. It may not longer work with pandas versions, but I haven't tried.
You should be able to download that file, then just make a class that has attributes like you want. IE
df = MetaDataframe() df.a = a df.b = b
I thought that after 0.16, it was possible to simply subclass a dataframe, right?
IE
class MyDF(DataFrame) self.a = 50 self.b = 20
Or is this not the case?
On Sat, Jan 23, 2016 at 8:28 AM, DaCoEx notifications@github.com wrote:
Would the features offered by xarray be somthing that can be adopted here? Data Structures http://xarray.pydata.org/en/stable/data-structures.html
They have data attributes. If pandas could get the same features, this would be great. Unit conversion, unit propagation, etc.
— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/2485#issuecomment-174186659.
Adam Hughes Physics Ph.D Candidate George Washington University
Adam Hughes Physics Ph.D Candidate George Washington University
I think xarrayis what you want.
So did you want to express that all aiming at using metadata may better use xarray?
They have data attributes. If pandas could get the same features, this would be great. Unit conversion, unit propagation, etc.
Just to be clear, xarray does support adding arbitrary metadata, but not automatic unit conversion. We could hook up a library like pint to handle this, but it's difficult to get all the edge cases working until numpy has better dtype support.
I think 'automatic unit conversion based on metadata attached to series' is a significantly different and more involved feature request to this issue. I hope a simpler upstream supported solution allowing attaching simple text-only metadata can be found before increasing the scope too much.
On 25 January 2016 at 17:14, Stephan Hoyer notifications@github.com wrote:
They have data attributes. If pandas could get the same features, this would be great. Unit conversion, unit propagation, etc.
Just to be clear, xarray does support adding arbitrary metadata, but not automatic unit conversion. We could hook up a library like pint to handle this, but it's difficult to get all the edge cases working until numpy has better dtype support.
— Reply to this email directly or view it on GitHub https://github.com/pydata/pandas/issues/2485#issuecomment-174558259.
This is quite simple in current versions of pandas.
I am using a sub-class here for illustration purposes.
Really all that would be needed would be adding the __finalize__
to most of the construction methods
(this already exists now for Series
, but not really for DataFrame
).
unambiguous propogation would be quite easy, and users could add in there own __finalize__
to handle more complicated cases (e.g. what would would you do when you have df + df2
)?
In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:from pandas import DataFrame
:
:class MyDataFrame(DataFrame):
: _metadata = ['attrs']
:
: @property
: def _constructor(self):
: return MyDataFrame
:
: def _combine_const(self, other, *args, **kwargs):
: return super(MyDataFrame, self)._combine_const(other, *args, **kwargs).__finalize__(self)
:--
In [2]: df = MyDataFrame({'A' : [1,2,3]})
In [3]: df.attrs = {'foo' : 'bar'}
In [4]: df.attrs
Out[4]: {'foo': 'bar'}
In [5]: (df+1).attrs
Out[5]: {'foo': 'bar'}
Would take a patch for this, the modification are pretty straightforward, its the testing that is the key here.
@jreback is there a generic way to persist metadata amongst all transforms applied to a dataframe including groupbys? Or would one have to go through and override a lot of methods' to call __finalize__
?
@postelrich for most/all things, __finalize__
should already be defined (and so in theory you can make it persist attributes). Its not tested really well though.
For Series
I think this is quite robust. DataFrame
pretty good. I doubt this works at all for groupby / merge / most reductions. Those are really dependent on the __finalize__
(it may or may not be called), that is the simple part. The hard part is deciding what to do.
I've been working on an implementation of this that handles the propagation problem by making the Metadata object itself subclass Series. Then patch Series to relay methods to Metadata. Roughly:
class MSeries(pd.Series):
def __init__(self, *args, **kwargs):
pd.Series.__init__(self, *args, **kwargs)
self.metadata = SMeta(self)
def __add__(self, other):
res = pd.series.__add__(self, other)
res.metadata = self.metadata.__add__(other)
return res
class SMeta(pd.Series):
def __init__(self, parent):
super(...)
self.parent = parent
def __add__(self, other):
new_meta = SMeta(index=self.index)
other_meta = [... other or other.metadata or None depending ...]
for key in self.index:
new_meta[key] = self[key].__add__(other)
So it is up to the individual MetaDatum classes to figure out how to propagate.
I've generally got this working. The part that I have not gotten working is the desired MFrame
behavior df.metadata['A'] is df['A'].metadata
. Any ideas on how to make that happen?
Propagation of attributes (defined in _metadata) gives me some headaches...
Based on the code of jreback, I've tried the following:
from pandas import DataFrame
class MyDataFrame(DataFrame):
_metadata = ['attrs']
@property
def _constructor(self):
return MyDataFrame
def _combine_frame(self, other, *args, **kwargs):
return super(MyDataFrame, self)._combine_frame(other, *args, **kwargs).__finalize__(self)
dfA = MyDataFrame({'A' : [1,2,3]})
dfA.attrs = {'foo' : 'bar'}
dfB = MyDataFrame({'B' : [6,7,8]})
dfB.attrs = {'fuzzy': 'busy'}
dfC = dfA.append(dfB)
dfC.attrs #Returns error: 'MyDataFrame' object has no attribute 'attrs'
#I would like that it would be {'foo': 'bar'}
As jreback mentioned: there should be made choices: what to do with the appended atttributes. However: I would be really helped when the attributes of only '''dfA''' simply propagate towards '''dfC'''
EDIT: more headache is more better, it pushes me to think harder :). Solved it, by stealing the finalize solution which GeoPandas provided. finalize works pretty good indeed. However, I'm not experienced enough to perform the testing.
Can't we just put metadata in the column name and change how columns are accessed? E.g. ["id"] would internally translate to {"name": "id"}.
Don't know the internals of pandas, so sorry if this might be a little naive. To me it just seems that the column name is really consistent across operations
My use case would be adding a description to "indicator variables" (just 0/1) which are otherwise look like var#1
, var#2
etc, and I do not want to pollute those names with potentially long values they actually stand for.
I think we have _metadata
https://pandas.pydata.org/pandas-docs/stable/development/extending.html#define-original-properties and .attrs
defined for this metadata use cases. If these don't sufficiently cover the necessary use cases, new issues can be created about those 2 methods. Closing
related: https://github.com/pydata/pandas/issues/39 (column descriptions) https://github.com/pydata/pandas/issues/686 (serialization concerns) https://github.com/pydata/pandas/issues/447#issuecomment-11152782 (Feature request, implementation variant)
Ideas and issues: