pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.73k stars 17.95k forks source link

BUG: inconsistent behavior on multi-dimensional slicing based on type #45303

Open tacaswell opened 2 years ago

tacaswell commented 2 years ago

Pandas version checks

Reproducible Example

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

x, y = np.random.randn(2, 100)
x = pd.Series(x, dtype="Float32")
z = pd.Series(range(5))

x[:, None] # fails with ValueError
z[:, None] # warns but promised to be IndexError

Issue Description

Per discussion around https://github.com/pandas-dev/pandas/issues/35527 / https://github.com/matplotlib/matplotlib/issues/18158and related links there have been a bunch of issues about multi-dimensional indexing in to Series and objects that have inconsistent .ndims and .shape.

In a departure from numpy it was my understanding that the implicit broadcasting to higher dimensions was going to be dropped by pandas at some point in the future (although it seems to still warn for builtin types). However, for the new missing-data types trying to do this 2D slicing raises a ValueError which was reported as a bug to Matplotlib https://github.com/matplotlib/matplotlib/issues/22125

In [15]: x[:, None]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/.virtualenvs/sys310/lib/python3.10/site-packages/pandas/core/series.py, in Series._get_values(self, indexer)
   1025 try:
-> 1026     new_mgr = self._mgr.getitem_mgr(indexer)
   1027     return self._constructor(new_mgr).__finalize__(self)

File ~/.virtualenvs/sys310/lib/python3.10/site-packages/pandas/core/internals/managers.py, in SingleBlockManager.getitem_mgr(self, indexer)
   1635 blk = self._block
-> 1636 array = blk._slice(indexer)
   1637 if array.ndim > 1:
   1638     # This will be caught by Series._get_values

File ~/.virtualenvs/sys310/lib/python3.10/site-packages/pandas/core/internals/blocks.py, in ExtensionBlock._slice(self, slicer)
   1553         raise AssertionError(
   1554             "invalid slicing for a 1-ndim ExtensionArray", slicer
   1555         )
-> 1557 return self.values[slicer]

File ~/.virtualenvs/sys310/lib/python3.10/site-packages/pandas/core/arrays/masked.py, in BaseMaskedArray.__getitem__(self, item)
    144 item = check_array_indexer(self, item)
--> 146 return type(self)(self._data[item], self._mask[item])

File ~/.virtualenvs/sys310/lib/python3.10/site-packages/pandas/core/arrays/floating.py, in FloatingArray.__init__(self, values, mask, copy)
    251     raise TypeError(
    252         "values should be floating numpy array. Use "
    253         "the 'pd.array' function instead"
    254     )
--> 255 super().__init__(values, mask, copy=copy)

File ~/.virtualenvs/sys310/lib/python3.10/site-packages/pandas/core/arrays/masked.py, in BaseMaskedArray.__init__(self, values, mask, copy)
    122 if values.ndim != 1:
--> 123     raise ValueError("values must be a 1D array")
    124 if mask.ndim != 1:

ValueError: values must be a 1D array

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Input In [15], in <module>
----> 1 x[:, None]

File ~/.virtualenvs/sys310/lib/python3.10/site-packages/pandas/core/series.py, in Series.__getitem__(self, key)
    963     key = np.asarray(key, dtype=bool)
    964     return self._get_values(key)
--> 966 return self._get_with(key)

File ~/.virtualenvs/sys310/lib/python3.10/site-packages/pandas/core/series.py, in Series._get_with(self, key)
    976     raise TypeError(
    977         "Indexing a Series with DataFrame is not "
    978         "supported, use the appropriate DataFrame column"
    979     )
    980 elif isinstance(key, tuple):
--> 981     return self._get_values_tuple(key)
    983 elif not is_list_like(key):
    984     # e.g. scalars that aren't recognized by lib.is_scalar, GH#32684
    985     return self.loc[key]

File ~/.virtualenvs/sys310/lib/python3.10/site-packages/pandas/core/series.py, in Series._get_values_tuple(self, key)
   1008 def _get_values_tuple(self, key):
   1009     # mpl hackaround
   1010     if com.any_none(*key):
-> 1011         result = self._get_values(key)
   1012         deprecate_ndim_indexing(result, stacklevel=5)
   1013         return result

File ~/.virtualenvs/sys310/lib/python3.10/site-packages/pandas/core/series.py, in Series._get_values(self, indexer)
   1027     return self._constructor(new_mgr).__finalize__(self)
   1028 except ValueError:
   1029     # mpl compat if we look up e.g. ser[:, np.newaxis];
   1030     #  see tests.series.timeseries.test_mpl_compat_hack
   1031     # the asarray is needed to avoid returning a 2D DatetimeArray
-> 1032     return np.asarray(self._values[indexer])

File ~/.virtualenvs/sys310/lib/python3.10/site-packages/pandas/core/arrays/masked.py, in BaseMaskedArray.__getitem__(self, item)
    142     return self._data[item]
    144 item = check_array_indexer(self, item)
--> 146 return type(self)(self._data[item], self._mask[item])

File ~/.virtualenvs/sys310/lib/python3.10/site-packages/pandas/core/arrays/floating.py, in FloatingArray.__init__(self, values, mask, copy)
    250 if not (isinstance(values, np.ndarray) and values.dtype.kind == "f"):
    251     raise TypeError(
    252         "values should be floating numpy array. Use "
    253         "the 'pd.array' function instead"
    254     )
--> 255 super().__init__(values, mask, copy=copy)

File ~/.virtualenvs/sys310/lib/python3.10/site-packages/pandas/core/arrays/masked.py, in BaseMaskedArray.__init__(self, values, mask, copy)
    118     raise TypeError(
    119         "mask should be boolean numpy array. Use "
    120         "the 'pd.array' function instead"
    121     )
    122 if values.ndim != 1:
--> 123     raise ValueError("values must be a 1D array")
    124 if mask.ndim != 1:
    125     raise ValueError("mask must be a 1D array")

ValueError: values must be a 1D array

Given

   1008 def _get_values_tuple(self, key):
   1009     # mpl hackaround

I think we need to revisit this and make sure we are not both grumbling about the other and putting in dueling workarounds ;)

Expected Behavior

Behavior of the container to be independent of the contained type.

Installed Versions

# I have broken distutils and pd.show_versions() fails in a weird way?
jbrockmendel commented 2 years ago
x[:, None] # fails with ValueError

This goes through FloatingArray.__getitem__, which raises bc 2D FloatingArrays are not supported (much to my consternation). (Actually on master this returns a 2D object-dtype ndarray, which i could imagine being worse)

z[:, None]

This goes through ndarray.__getitem__ which works just how you would expect.

IIRC the plan is to enforce the deprecation in 2.0 so these would both raise.

I think we need to revisit this and make sure we are not both grumbling about the other and putting in dueling workarounds ;)

Do you need series[:, None] to work, or can will this deprecation being enforced be OK on your end?

tacaswell commented 2 years ago

We currently have a bunch of complicated warning contexts / try ... excepts to handle this, if both raised we would be perfectly happy! The issue is that we were not catching ValueError so a spurious error made it out to our user. I would want either in the FloatingArray case the series __getitem__ to catch the ValueError and re-raise an InedxingError or to have assurances that when the deprecation go through both raise ValueError.

The context we are using this in is https://github.com/matplotlib/matplotlib/blob/8d7a2b9d2a38f01ee0d6802dd4f9e98aec812322/lib/matplotlib/cbook/__init__.py#L1301-L1343 . We are using the failure of series[:, None] to identify we have been passed a pandas object and make no further use of the result.

We are also open to better suggestions about how to handle this (without importing pandas). I guess we would do some if 'pandas' in sys.modules logic, but that seems a bit messy.

https://github.com/pandas-dev/pandas/pull/30588 is probably also relevant.

jbrockmendel commented 2 years ago

if both raised we would be perfectly happy [...] or to have assurances that when the deprecation go through both raise ValueError

Both will raise in 2.0. I'm not sure if we have pinned down exactly what type of exception will be raised, but ValueError seems like a reasonable option.

We are also open to better suggestions about how to handle this (without importing pandas). I guess we would do some if 'pandas' in sys.modules logic, but that seems a bit messy.

With the appropriate caveats about relying on implementation details: most (non-scalar) pandas classes have a "_typ: str" attribute that are used to bootstrap isinstance checks without circular imports.

Pavanmahaveer7 commented 1 year ago

To add a new dimension to the existing data frame, by using values method we could do it. series will be failed to add dimension, Here everything depends on the series type we use. Float series type has that issue not the Integer.