Open postelrich opened 7 years ago
This is working as expected, if perhaps a bit confusing and inconvenient.
MultiIndex
is essentially a Categorical - levels
is an index of the unique values contained within the index, excluding missing values, which are 'special' and marked with a -1
in the labels
(integer codes mapping values to levels).
I'm not sure what the best way to fillna on a MultiIndex is - this works, but is very hacky
df = pd.read_csv(io.StringIO("""
a,b,c
1,,2
3,4,5"""), index_col=[0,1])
fill_na_value = 999.0
new_levels = df.index.levels[1].append(pd.Index([fill_na_value]))
new_labels = np.where(df.index.labels[1] == -1, len(df.index.levels[1]), df.index.labels[1])
df.index = df.index.set_levels(new_levels, level=1)
df.index = df.index.set_labels(new_labels, level=1)
df
Out[25]:
c
a b
1 999.0 2
3 4.0 5
Perhaps should implement MultiIndex.fillna
I think the confusing part is that levels
returns normal Index
objects. I would expect those to work as usual.
I think the confusing part is that levels returns normal Index objects. I would expect those to work as usual.
huh? what would you be expecting?
@jreback I would expect the Index
objects returned from levels
to contain the nan
s appropriately so that methods like MultiIndex.levels[0].fillna
work.
@jreback, example:
In [4]: i = pd.read_csv(io.StringIO("a,b,c\n1,,2\n3,4,5"), index_col=[0,1]).index.levels[1]
In [5]: i
Out[5]: Float64Index([4.0], dtype='float64', name='b')
In [6]: i.fillna(-1)
Out[6]: Float64Index([4.0], dtype='float64', name='b')
vs.
In [8]: i = pd.Float64Index([1, np.nan, 2])
In [9]: i
Out[9]: Float64Index([1.0, nan, 2.0], dtype='float64')
In [10]: i.fillna(-1)
Out[10]: Float64Index([1.0, -1.0, 2.0], dtype='float64')
In theory levels
are primarily an implementation detail, although in practice a leaky one.
In my opinion the real solution here is support to MultiIndex.fillna
, sig something like this:
pd.MultiIndex.fillna(self, value=None, downcast=None, level=None)
Docstring:
Fill NA/NaN values with the specified value, optionally in the specified level(s)
Parameters
----------
value : scalar
Scalar value to use to fill holes (e.g. 0).
This value cannot be a list-likes.
downcast : dict, default is None
a dict of item->dtype of what to downcast if possible,
or the string 'infer' which will try to downcast to an appropriate
equal type (e.g. float64 to int64 if possible)
level : int, level name, or sequence of int/level names (default None)
level(s) to fill for (None for all levels)
@chris-b1 if the index stored in levels
are going to have a different behavior than the usual Index
objects, should levels
be a hidden attribute?
Semantics, but in my view they don't have different behavior, but a different meaning. Index
(df.index) is an array-like of row labels, mapping 1-1 to rows in the DataFrame. MultiIndex.levels
is an ordered set of the unique non-null values for a particular level.
Categorical uses an Index
the same way.
In [43]: c = pd.Categorical([1, 2, np.nan])
In [44]: c
Out[44]:
[1.0, 2.0, NaN]
Categories (2, int64): [1, 2]
In [45]: c.categories
Out[45]: Int64Index([1, 2], dtype='int64')
@postelrich your example doesn't make any sense.
In [5]: df = pd.read_csv(io.StringIO("a,b,c\n1,,2\n3,4,5"), index_col=[0,1])
In [6]: df
Out[6]:
c
a b
1 NaN 2
3 4.0 5
In [7]: df.index
Out[7]:
MultiIndex(levels=[[1, 3], [4.0]],
labels=[[0, 1], [-1, 0]],
names=['a', 'b'])
In [8]: df.index.levels[0]
Out[8]: Int64Index([1, 3], dtype='int64', name='a')
In [9]: df.index.levels[1]
Out[9]: Float64Index([4.0], dtype='float64', name='b')
Missing values by-definition DO NOT show up in the levels, that is the entire point of the labels.
If you want the level values, then simply ask for them. These are then reconstructed. The .levels
array is simply a holder of a sparse form of the data; it is not marked as public, and we have getters and setters. So to be honest this is an implementation detail. I think we discussed in the past deprecating it entirely (which is not very hard)
In [11]: df.index.get_level_values(1)
Out[11]: Float64Index([nan, 4.0], dtype='float64', name='b')
Ok sure, maybe even just adding @chris-b1 's description to the docs "MultiIndex.levels is an ordered set of the unique non-null values for a particular level" would suffice.
Problem description
I am reading in a csv with
index_col
set to a list of values. Some of the column values are missing leading to nans in the index. If I dodf.index.levels[1]
, theFloat64Index
does not appear to have the nans in the unique value list. This renders nan related functions useless, likeFloat64Index.fillna
. However, the nans are still there if you dodf.index.get_level_values(1)
.Expected Output
In [26]: pd.read_csv(io.StringIO("a,b,c\n1,,2\n3,4,5"), index_col=[0,1]).index.levels[1] Out[26]: Float64Index([nan, 4.0], dtype='float64', name='b')
Output of
pd.show_versions()