pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.56k stars 17.89k forks source link

BUG: DataFrame.query() throws error when df has duplicate column names #59950

Open ddenuyl-bebr opened 2 weeks ago

ddenuyl-bebr commented 2 weeks ago

Pandas version checks

Reproducible Example

import pandas as pd

df = pd.DataFrame({'a': range(3), 'b': range(3), 'c': range(3)}).rename(columns={'b': 'a'})
print(df.query('c == 1'))

Issue Description

Since pandas 2.2.1 this throws an unexpected error: TypeError: dtype 'a int64 a int64 dtype: object' not understood

This is because DataFrame.query() calls DataFrame.eval() which in turn calls DataFrame._get_cleaned_column_resolvers().

The dict comprehension in DataFrame._get_cleaned_column_resolvers() was changed in version 2.2.1. version 2.2.0

return {
       clean_column_name(k): Series(
            v, copy=False, index=self.index, name=k
       ).__finalize__(self)
       for k, v in zip(self.columns, self._iter_column_arrays())
       if not isinstance(k, int)
 }

version 2.2.1

return {
            clean_column_name(k): Series(
                v, copy=False, index=self.index, name=k, dtype=self.dtypes[k]
            ).__finalize__(self)
            for k, v in zip(self.columns, self._iter_column_arrays())
            if not isinstance(k, int)
   }

since the dtypes are now checked when the Series are created, this introduces the error described above, since for a duplicate column name self.dtypes[k] returns a Series instead of single value.

Expected Behavior

1) I would expect either the behavior prior to v2.2.1 where the above example would return:

>>> df.query('c == 1')
   a  a  c
1  1  1  1

moreover, calling query() on column 'a' also works:

>>> df.query('a == 1')
   a  a  c
1  1  1  1

or 2) If above behavior is unwanted, I would except better error handling, smt like:

>>> df.query('c == 1')
DuplicateColumnError: DataFrame.query() is not supported for DataFrames with duplicate column names

Installed Versions

INSTALLED VERSIONS

commit : bdc79c146c2e32f2cab629be240f01658cfb6cc2 python : 3.11.6.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-165-generic Version : #182-Ubuntu SMP Mon Oct 2 19:43:28 UTC 2023 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 2.2.1 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.8.2 setuptools : 69.1.1 pip : 23.0 Cython : None pytest : 8.2.0 hypothesis : 6.100.4 sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 5.2.2 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : None pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : 0.19.2 pyarrow : 15.0.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : 2.0.27 tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

rhshadrach commented 2 weeks ago

Thanks for the report; it looks like changing DataFrame._get_cleaned_column_resolvers to be

return {
    clean_column_name(k): Series(
        v, copy=False, index=self.index, name=k, dtype=dtype
    ).__finalize__(self)
    for k, v, dtype in zip(self.columns, self._iter_column_arrays(), dtypes)
    if not isinstance(k, int)
}

will resolve the issue. PRs are welcome!

miguelcsx commented 2 weeks ago

Hi @rhshadrach , I'm working on this issue, can you assign it to me, thanks :+1:

saldanhad commented 2 weeks ago

Hi @miguelcsx, you can comment take to get this self assigned to you. Please refer to contributing guide here: https://pandas.pydata.org/docs/dev/development/contributing.html#finding-an-issue-to-contribute-to

PS: I am not a maintainer, just tried helping.

miguelcsx commented 2 weeks ago

take

miguelcsx commented 2 weeks ago

Hi @miguelcsx, you can comment take to get this self assigned to you. Please refer to contributing guide here: https://pandas.pydata.org/docs/dev/development/contributing.html#finding-an-issue-to-contribute-to

PS: I am not a maintainer, just tried helping.

@saldanhad I didn't know this, thank you very much :rocket:

Asifussain commented 1 week ago

take

miguelcsx commented 1 week ago

Hi @rhshadrach , I've solved it, and I added a test to make sure it doesn't keep failing, thanks

ddenuyl-bebr commented 1 week ago

thanks all for the swift response and resolution!

Asifussain commented 1 week ago

Hi, I am new to open source contribution. I was working with this and the test is in progress. I want to know if this is already closed?

sunlight798 commented 1 week ago

Hello, I am new contributor in pandas. Is this issue still open? Can i work on this issue?

tohfas commented 1 week ago

Hi, I am looking for this issue for my university assignment. Can this issue be assigned to me? Please let me know. Thanks.

saldanhad commented 1 week ago

It looks like someone is already working on this. For ongoing PRs, sometimes they might not be linked directly to the issue, so please check the PRs as well to see if someone has already submitted one. If you're looking for another task, I recommend checking out issues labeled with good first issue.

You can refer to the contributing guide here