pandas-dev / pandas-stubs

Public type stubs for pandas
BSD 3-Clause "New" or "Revised" License
230 stars 123 forks source link

Typing of DataFrame index and columns miss multiindex #994

Closed ldouteau closed 3 weeks ago

ldouteau commented 3 weeks ago

Hi,

Describe the bug The type returned by DataFrame.index and DataFrame.columns is, respectively, Index and Index[str]. This is invalid when the dataframe uses MultiIndex for their index and/or columns

To Reproduce 1.

import numpy as np
import pandas as pd

df = pd.DataFrame(
    np.arange(6).reshape((3, 2)),
    index=pd.MultiIndex.from_product((("i",), ("j1", "j2", "j3"))),
    columns=pd.MultiIndex.from_product((("a",), ("b1", "b2"))),
)
idx = df.index  # type show as Index
cols = df.columns  # type shown as Index[str]
  1. pyright version 1.1.373, commit ee424479
  2. None. Just check the type revealed for idx and cols, doesn't match the expected multiindex

Please complete the following information:

Additional context Properties implemented in https://github.com/pandas-dev/pandas-stubs/blob/b246fcff196b70e995a0c61ffa3ab78a64c07e7d/pandas-stubs/core/frame.pyi#L1519 and https://github.com/pandas-dev/pandas-stubs/blob/b246fcff196b70e995a0c61ffa3ab78a64c07e7d/pandas-stubs/core/frame.pyi#L1533

Getters should be updated too I guess, not sure if this should be a specific issue

Dr-Irv commented 3 weeks ago

MultiIndex is a subclass of Index, so the result for index is technically correct. From a static typing perspective, we can't track the type of the index that is inside of a DataFrame or even the type of Index backing the columns. We've chosen the most common values, so that most people don't have to cast the result. If you know that df.index or df.columns is a MultiIndex, you will have to cast the result.

Part of the issue here is that if you have a DataFrame, you can call set_index(), and that could make the index of the DF either a regular Index or a MultiIndex. Tracking that with static typing doesn't seem possible.

I'm closing this, but am willing to reopen if you have suggestions for handling this.

ldouteau commented 3 weeks ago

MultiIndex is a subclass of Index

Got it, it missed that. Then the typing for df.index is ok, but df.columns should use Index instead of Index[str]

If you know that df.index or df.columns is a MultiIndex, you will have to cast the result.

That's what i ended up doing, even though i don't like to add those lines. It's quite sensitive to API changes IMO

Dr-Irv commented 3 weeks ago

Got it, it missed that. Then the typing for df.index is ok, but df.columns should use Index instead of Index[str]

I think we used to have that, but that meant there were certain cases where you'd have to do an inconvenient cast. For example, if you did something like df[df.columns[0]], you'd get a typing error, because df.columns[0] is a Scalar