modin-project / modin

Modin: Scale your Pandas workflows by changing a single line of code
http://modin.readthedocs.io
Apache License 2.0
9.75k stars 651 forks source link

`loc` on MultiIndex row & column DataFrame returns wrong view of data #4684

Open pyrito opened 2 years ago

pyrito commented 2 years ago

System information

multi_index = pd.MultiIndex.from_tuples( [("r0", "rA"), ("r1", "rB")], names=["Courses", "Fee"] ) cols = pd.MultiIndex.from_tuples( [ ("Gasoline", "Toyota"), ("Gasoline", "Ford"), ("Electric", "Tesla"), ("Electric", "Nio"), ] ) data = [[100, 300, 900, 400], [200, 500, 300, 600]] df = pd.DataFrame(data, columns=cols, index=multi_index) pdf = pandas.DataFrame(data, columns=cols, index=multi_index)

pdf.loc[("r0"), ("Gasoline", "Toyota")] df.loc[("r0"), ("Gasoline", "Toyota")] # Returns wrong value


### Describe the problem
pandas returns this output:

Fee rA 100 Name: (Gasoline, Toyota), dtype: int64

whereas Modin gives us this:

100


The root of the issue seems to be in how we are handling `loc` lookups for MultiIndex DataFrames in the first place. The root of the issue seems to be different from the other MultiIndex issues we've had to deal with.

### Source code / logs
<!-- Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem. -->
pyrito commented 2 years ago

I did a bit of investigation here and it looks like ndim is 0 for this case, which appears to be the incorrect calculation. We may be having the incorrect abstraction for the MultiIndex case.

eromoe commented 1 year ago

I also come up same problem.

df2 = pd.DataFrame(np.arange(16).reshape(-1, 4), index=pd.MultiIndex.from_tuples(zip(list('zxzx'), [0,1,2,4]), names=['qq','ww']), columns=list('abcd'))
df3 = df2.unstack(0)

# modin
df3.loc[:, ['a']]
# pandas
df3._to_pandas().loc[:, ['a']]

# modin
df3.loc[:, [('a', 'x'), ('a', 'z')]]

# pandas
df3._to_pandas().loc[:, [('a', 'x'), ('a', 'z')]]

# modin
df3[['a']]

# pandas
df3._to_pandas()[['a']]

image

image

image