hgrecco / pint-pandas

Pandas support for pint
Other
166 stars 40 forks source link

DataFrame.reset_index puts units to dataframe index cells #231

Open szaiserb opened 1 month ago

szaiserb commented 1 month ago

Bug description

DataFrame.set_index puts units to dataframe index cells. I was very surprised when I found out, and I currently need to work around it. For the actual dataframe data cells this behavior is clearly not intended (quote from docs):

If you ever see units in the cells of the DataFrame, something isn’t right.

Minimum example

import pint_pandas
import pandas as pd

df = pd.DataFrame({'a': [1.0, 2.0]})
df['a'] = df['a'].astype(pint_pandas.PintType(ureg.second))
print(df['a'])
print(df.set_index('a').index)

pint_pandas.show_versions()

Output:

0    1.0
1    2.0
Name: a, dtype: pint[second]
Index([1.0 second, 2.0 second], dtype='pint[second]', name='a')

{'numpy': '1.26.4', 'pandas': '2.2.1', 'pint': '0.23', 'pint_pandas': '0.5'}
mflova commented 1 month ago

Why isn't this supposed to be the desired behaviour? This is the way pandas works. When you perform set_index over a column, not only the values are used as index but also its dtype

andrewgsavage commented 1 month ago

Seeing the units in the cells mean the data is stored as an array of quantities inside the PintArray , as opposed to an array of units or floats.

This looks like one of the PintArray init paths doesn't behave as expected

szaiserb commented 1 month ago

When you perform set_index over a column, not only the values are used as index but also its dtype

Using the column dtype for the index on .set_index() is perfect, however my expectation is to have type(df.index[0]) = float and df.index.dtype = pint[<unit>]. Then, df.index behaves largely like df[<column_name>]. Having type(df.index[0]) = pint[<unit>] would only be required on a mixed - type index (which I do not see any usecase for).

andrewgsavage commented 1 month ago

looks like it is a bug in pandas, index doesnt use the data's dtype's formating func https://github.com/pandas-dev/pandas/blob/3b48b17e52f3f3837b9ba8551c932f44633b5ff8/pandas/core/indexes/base.py#L1411

This is as expected:

df = df.set_index('a',drop=False)
i = df.index
i.values

<PintArray>
[1.0, 2.0]
Length: 2, dtype: pint[second]