pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.59k stars 17.9k forks source link

malformed index when reading csv with multiindex #17576

Open postelrich opened 7 years ago

postelrich commented 7 years ago
In [23]: pd.read_csv(io.StringIO("a,b,c\n1,,2"), index_col=1).index
Out[23]: Float64Index([nan], dtype='float64', name='b')

In [24]: pd.read_csv(io.StringIO("a,b,c\n1,,2\n3,,4"), index_col=1).index
Out[24]: Float64Index([nan, nan], dtype='float64', name='b')

In [26]: pd.read_csv(io.StringIO("a,b,c\n1,,2\n3,4,5"), index_col=[0,1]).index.levels[1]
Out[26]: Float64Index([4.0], dtype='float64', name='b')

In [27]: pd.read_csv(io.StringIO("a,b,c\n1,,2\n3,4,5"), index_col=1).index
Out[27]: Float64Index([nan, 4.0], dtype='float64', name='b')

Problem description

I am reading in a csv with index_col set to a list of values. Some of the column values are missing leading to nans in the index. If I do df.index.levels[1], the Float64Index does not appear to have the nans in the unique value list. This renders nan related functions useless, like Float64Index.fillna. However, the nans are still there if you do df.index.get_level_values(1).

Expected Output

In [26]: pd.read_csv(io.StringIO("a,b,c\n1,,2\n3,4,5"), index_col=[0,1]).index.levels[1] Out[26]: Float64Index([nan, 4.0], dtype='float64', name='b')

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.3.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.20.3 pytest: 3.1.3 pip: 9.0.1 setuptools: 27.2.0 Cython: None numpy: 1.13.1 scipy: 0.19.1 xarray: None IPython: 6.1.0 sphinx: None patsy: 0.4.1 dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: None tables: 3.3.0 numexpr: 2.6.2 feather: None matplotlib: None openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999 sqlalchemy: 1.1.11 pymysql: None psycopg2: 2.7.1 (dt dec pq3 ext lo64) jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None
chris-b1 commented 7 years ago

This is working as expected, if perhaps a bit confusing and inconvenient.

MultiIndex is essentially a Categorical - levels is an index of the unique values contained within the index, excluding missing values, which are 'special' and marked with a -1 in the labels (integer codes mapping values to levels).

I'm not sure what the best way to fillna on a MultiIndex is - this works, but is very hacky

df = pd.read_csv(io.StringIO("""
a,b,c
1,,2
3,4,5"""), index_col=[0,1])

fill_na_value = 999.0
new_levels = df.index.levels[1].append(pd.Index([fill_na_value]))
new_labels = np.where(df.index.labels[1] == -1, len(df.index.levels[1]), df.index.labels[1])
df.index = df.index.set_levels(new_levels, level=1)
df.index = df.index.set_labels(new_labels, level=1)

df
Out[25]: 
         c
a b       
1 999.0  2
3 4.0    5

Perhaps should implement MultiIndex.fillna

postelrich commented 7 years ago

I think the confusing part is that levels returns normal Index objects. I would expect those to work as usual.

jreback commented 7 years ago

I think the confusing part is that levels returns normal Index objects. I would expect those to work as usual.

huh? what would you be expecting?

postelrich commented 7 years ago

@jreback I would expect the Index objects returned from levels to contain the nans appropriately so that methods like MultiIndex.levels[0].fillna work.

postelrich commented 7 years ago

@jreback, example:

In [4]: i = pd.read_csv(io.StringIO("a,b,c\n1,,2\n3,4,5"), index_col=[0,1]).index.levels[1]

In [5]: i
Out[5]: Float64Index([4.0], dtype='float64', name='b')

In [6]: i.fillna(-1)
Out[6]: Float64Index([4.0], dtype='float64', name='b')

vs.

In [8]: i = pd.Float64Index([1, np.nan, 2])

In [9]: i
Out[9]: Float64Index([1.0, nan, 2.0], dtype='float64')

In [10]: i.fillna(-1)
Out[10]: Float64Index([1.0, -1.0, 2.0], dtype='float64')
chris-b1 commented 7 years ago

In theory levels are primarily an implementation detail, although in practice a leaky one.

In my opinion the real solution here is support to MultiIndex.fillna, sig something like this:

pd.MultiIndex.fillna(self, value=None, downcast=None, level=None)

Docstring:
Fill NA/NaN values with the specified value, optionally in the specified level(s)

Parameters
----------
value : scalar
    Scalar value to use to fill holes (e.g. 0).
    This value cannot be a list-likes.
downcast : dict, default is None
    a dict of item->dtype of what to downcast if possible,
    or the string 'infer' which will try to downcast to an appropriate
    equal type (e.g. float64 to int64 if possible)
level : int, level name, or sequence of int/level names (default None)
    level(s) to fill for (None for all levels)
postelrich commented 7 years ago

@chris-b1 if the index stored in levels are going to have a different behavior than the usual Index objects, should levels be a hidden attribute?

chris-b1 commented 7 years ago

Semantics, but in my view they don't have different behavior, but a different meaning. Index (df.index) is an array-like of row labels, mapping 1-1 to rows in the DataFrame. MultiIndex.levels is an ordered set of the unique non-null values for a particular level.

Categorical uses an Index the same way.

In [43]: c = pd.Categorical([1, 2, np.nan])

In [44]: c
Out[44]: 
[1.0, 2.0, NaN]
Categories (2, int64): [1, 2]

In [45]: c.categories
Out[45]: Int64Index([1, 2], dtype='int64')
jreback commented 7 years ago

@postelrich your example doesn't make any sense.

In [5]: df = pd.read_csv(io.StringIO("a,b,c\n1,,2\n3,4,5"), index_col=[0,1])

In [6]: df
Out[6]: 
       c
a b     
1 NaN  2
3 4.0  5

In [7]: df.index
Out[7]: 
MultiIndex(levels=[[1, 3], [4.0]],
           labels=[[0, 1], [-1, 0]],
           names=['a', 'b'])

In [8]: df.index.levels[0]
Out[8]: Int64Index([1, 3], dtype='int64', name='a')

In [9]: df.index.levels[1]
Out[9]: Float64Index([4.0], dtype='float64', name='b')

Missing values by-definition DO NOT show up in the levels, that is the entire point of the labels.

jreback commented 7 years ago

If you want the level values, then simply ask for them. These are then reconstructed. The .levels array is simply a holder of a sparse form of the data; it is not marked as public, and we have getters and setters. So to be honest this is an implementation detail. I think we discussed in the past deprecating it entirely (which is not very hard)

In [11]: df.index.get_level_values(1)
Out[11]: Float64Index([nan, 4.0], dtype='float64', name='b')
postelrich commented 7 years ago

Ok sure, maybe even just adding @chris-b1 's description to the docs "MultiIndex.levels is an ordered set of the unique non-null values for a particular level" would suffice.