pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.91k stars 18.03k forks source link

Ambiguous behaviour when index is bool type? #43194

Open MacroBull opened 3 years ago

MacroBull commented 3 years ago

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# Your code here

pandas.Series([True, True, False]).value_counts()[[False]]

Problem description

[this should explain why the current behaviour is a problem and why the expected output is a better solution]

pandas/core/indexers.py in check_array_indexer(array, indexer)
    468         # GH26658
    469         if len(indexer) != len(array):
--> 470             raise IndexError(
    471                 f"Boolean index has wrong length: "
    472                 f"{len(indexer)} instead of {len(array)}"

IndexError: Boolean index has wrong length: 1 instead of 2

It seems that Series.__getitem__ and check_bool_indexer do not work well for bool-typed index

Expected Output

<pandas.Series>
False    1
dtype: int64

Output of pd.show_versions()

[paste the output of ``pd.show_versions()`` here leaving a blank line after the details tag] INSTALLED VERSIONS ------------------ commit : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3 python : 3.8.5.final.0 python-bits : 64 OS : Darwin OS-release : 20.6.0 Version : Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:31 PDT 2021; root:xnu-7195.141.2~5/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8 pandas : 1.1.4 numpy : 1.19.1 pytz : 2020.1 dateutil : 2.8.1 pip : 20.2.3 setuptools : 49.6.0.post20200814 Cython : None pytest : None hypothesis : None sphinx : 3.2.1 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.18.1 pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.3.2 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pyxlsb : None s3fs : None scipy : 1.5.3 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None
attack68 commented 3 years ago

This is an edge case.

There are far more practical uses for an Indexer with Boolean values than trying to index True/False indexed DataFrame. I am not convinced the codebase warrants a check for this kind - possibly if it is very simple and doesn't interefere with a lot of logic.

A workaround for a dynamic workflow is: pd.Series([True, True, False]).value_counts().sort_index().iloc[0]

simonjayhawkins commented 3 years ago

There are far more practical uses for an Indexer with Boolean values than trying to index True/False indexed DataFrame.

hmm. this does make the result inconsistent with say pandas.Series([1, 2, 3])[[0]]

A workflow could potentially have the index generated from unknown data. e.g as in the OP, from value_counts

I think we should mark as a bug, pending furher investigation on how feasible it is is distinguish between a fancy indexer and a boolean indexer.

Rudransh24 commented 3 years ago

Hi! I am new to open source and wanted to contribute. I wanted to know how to proceed and whether this issue is open to a PR?

simonjayhawkins commented 3 years ago

Thanks @Rudransh24. Probably best to look for an issue labelled good first issue. https://github.com/pandas-dev/pandas/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22

This one probably requires more discussion due to potential ambiguity that could arise.

kwhkim commented 3 years ago

I am aware of this edge case. As far as I figured out, a list consisted only of True and Falses works as boolean values(not index).

In this case, I would do

x = pandas.Series([True, True, False]).value_counts()
x[ x.index == False]

or you can slip in some value other than True or False

pandas.Series([True, True, False]).value_counts().append(pandas.Series([0], index = ['None']), verify_integrity=False)[[True, 'None']]

I think there should be something like .ibool for boolean indexing for completeness and consistency even though there is little chance that things like this happen.

I am from R. And from the designing persepective, python including pandas has so many exceptions. For example, even importing a module has some egde case like that import time never imports time.py. I think it's poor design problem

MacroBull commented 3 years ago

Yes, I encountered this issue when I was working on some generic code, where I can not expect what input type will accurately be. The workaround currently I'm using is manually invoking get_indexer combined with iloc, so bool indexer is not considered.

locs = df.index.get_indexer(index)
values = df.iloc[locs]

Which is almost the same way the direct index method do. The workarounds provided by @kwhkim, either force bool indexing or append None-index, I think, could be performance-impacting in some way. In my opinion, when an ambiguous index given, the direct __getitem__ should give a warning about the potential bool indexing or even raise an exception.