BUG: "cannot reindex from duplicate axis" thrown using unique indexes, duplicated column names and a specific numpy array values

igorluppi commented 4 years ago

Code Sample

import pandas 
import numpy as np

a = np.array([[1,2],[3,4]]) 

# DO NOT WORKS
b = np.array([[0.5,6],[7,8]])  
# b = np.array([[.5,6],[7,8]])  # The same problem

# This one works fine:
# b = np.array([[5,6],[7,8]]) 

dfA = pandas.DataFrame(a)
# This works fine EVEN using .5, because the columns name is different
# dfA = pandas.DataFrame(a, columns=['a','b'])
dfB = pandas.DataFrame(b)

df_new = pandas.concat([dfA, dfB], axis = 1)

print(df_new[df_new > 5])

Problem description

It has a bug that combines numpy specific values and duplicated DataFrame column names when it's used a select operation, such as df[df > 5]. A exception is thrown saying "cannot reindex from duplicate axis", however It should not be, because:

The DataFrame has no duplicated indexes ( df.index.is_unique is True)
The DataFrame has duplicated column names, but should not be a problem when we apply the selection operation, such as df_new[df_new > 5]
The DataFrame uses floator int numpy values, so it should not change the behavior of the code

However the values in the numpy array DO changes the behavior of the DataFrame selection, if the DataFrame has duplicated column names.

Expected Output

    0   1    0  1
0 NaN NaN  NaN  6
1 NaN NaN  7.0  8

Current Output

~/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
   3097         # trying to reindex on an axis with duplicates
   3098         if not self.is_unique and len(indexer):
-> 3099             raise ValueError("cannot reindex from a duplicate axis")
   3100 
   3101     def reindex(self, target, method=None, level=None, limit=None, tolerance=None):

ValueError: cannot reindex from a duplicate axis

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : None python : 3.6.9.final.0 python-bits : 64 OS : Linux OS-release : 5.3.0-28-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : pt_BR.UTF-8 pandas : 1.0.1 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 20.0.2 setuptools : 45.2.0 Cython : None pytest : None hypothesis : None sphinx : 2.3.1 blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.1 IPython : 7.12.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : 3.1.3 numexpr : 2.7.1 odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pytest : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : 3.6.1 tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None

igorluppi commented 4 years ago

Moreover, doing this:

In [177]:  new_df = df.reset_index(drop=True) 
In [178]:  new_df[new_df > 10]

~/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
   3097         # trying to reindex on an axis with duplicates
   3098         if not self.is_unique and len(indexer):
-> 3099             raise ValueError("cannot reindex from a duplicate axis")
   3100 
   3101     def reindex(self, target, method=None, level=None, limit=None, tolerance=None):

ValueError: cannot reindex from a duplicate axis

So, it's 100% sure that we have no duplicates here, so what is going on?

MarcoGorelli commented 4 years ago

Thanks @igorluppi

I just tried

df = pd.DataFrame(np.random.randn(150001, 792))
df[df>10]

and got no error - could you give us some more details about your dataframe? Do you still get the error if you only consider its head, or if you only use (say) its first 5 columns?

igorluppi commented 4 years ago

I have many dataframes, and a put all of them in a single one:

let df_items be a list of dataframes. I got this error using: df_final = pandas.concat(df_items, axis = 1)

However I verified that df_final = reduce(lambda x, y: pandas.merge(x, y, left_index=True, right_index=True, how='outer'), df_items)

Works fine, I got the same result DF and when I apply df_final[df_final>10] it works. But this method requires a long process to be done, concat is faster than it (at least 10 times faster).

Thanks for https://stackoverflow.com/questions/45885043/pandas-concat-cannot-reindex-from-a-duplicate-axis?rq=1 about this possible solution. But why the error happens?

MarcoGorelli commented 4 years ago

Applying df[df>10] I got "cannot reindex from duplicate axis",

I got this error using: df_final = pandas.concat(df_items, axis = 1)

Sorry, I'm a bit confused, which command gave you the error - pd.concat or df[df>10]?

igorluppi commented 4 years ago

pd.concat gave me the df_final, this df_final got that error when I use df_final[df_final>10] The interesting part is, when I use the reduce method and got a df_final2, this one works, in another words df_final2[df_final2>10] works fine.

Moreover,

In [14]: df_final.equals(df_final2)                                                                                                                                                                            
Out[14]: False

But I didnt find where the difference is

igorluppi commented 4 years ago

@MarcoGorelli I found the why the problem is happening but this implies in another problem regarding the exception I got. Give me a second

igorluppi commented 4 years ago

@MarcoGorelli "cannot reindex from duplicate axis" should be broken in two messages: both "cannot reindex from duplicate index" and "cannot reindex from duplicate columns". I will explain why.

Why is that? Because all the messages and solutions I was looking for told me to took at the indexes, but in my case I found duplicated columns.

But why the second case worked? reduce(lambda x, y: pandas.merge(x, y, left_index=True, right_index=True, how='outer'), df_items) In this case, when it finds a duplicated column, automatically it appended a string "_x" to the duplicated, it became "duplicated_column_x" It's not the case for concat, it keeps the duplicated column name "duplicated_column".

My sugestion

Please change the exception, to be specific that the problem belongs to the column (or index). Just saying duplicate axis was a little bit confused to find the solution

MarcoGorelli commented 4 years ago

Thanks @igorluppi

tbh I still can't reproduce the error:

df = pd.DataFrame([0, 1], columns=['a'])
new_df = pd.concat([df, df], axis=1)
new_df[new_df>0]  # works

could you try coming up with a minimal reproducible example?

igorluppi commented 4 years ago

Ok, I will create a simple example

igorluppi commented 4 years ago

@MarcoGorelli

import pandas 
import numpy as np

a = np.array([[1,2],[3,4]]) 

# DO NOT WORKS
b = np.array([[0.5,6],[7,8]]) 
# OR
# b = np.array([[.5,6],[7,8]])

# This one works fine:
# b = np.array([[5,6],[7,8]]) 

dfA = pandas.DataFrame(a)
# This works fine EVEN using .5, because the columns name is different
# dfA = pandas.DataFrame(a, columns=['a','b'])
dfB = pandas.DataFrame(b)

df_new = pandas.concat([dfA, dfB], axis = 1)

df_new[df_new>3]

~/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
   3097         # trying to reindex on an axis with duplicates
   3098         if not self.is_unique and len(indexer):
-> 3099             raise ValueError("cannot reindex from a duplicate axis")
   3100 
   3101     def reindex(self, target, method=None, level=None, limit=None, tolerance=None):

ValueError: cannot reindex from a duplicate axis

Basically, using .5 or 0.5 in numpy there breaks the dataframe operation. This might be a problem with pandas + numpy .

The interesting part is: Numpy float values just break the code if we have duplication on columns name.

MarcoGorelli commented 4 years ago

@igorluppi great, thanks! Could you edit this example into the original post?

igorluppi commented 4 years ago

@MarcoGorelli

For sure, it's done my friend!

igorluppi commented 4 years ago

@MarcoGorelli is it a bug ? Anything new ?

MarcoGorelli commented 4 years ago

cc @jorisvandenbossche

igorluppi commented 4 years ago

should someone from numpy-dev look at this? @jorisvandenbossche @MarcoGorelli

MarcoGorelli commented 4 years ago

should someone from numpy-dev look at this? @jorisvandenbossche @MarcoGorelli

I don't think so - I presume the core team is prioritising what'll be in the v1.0.2 release. I'm working on another issue at the moment but I plan to get back to this

igorluppi commented 4 years ago

Any news ? @MarcoGorelli @jorisvandenbossche

MarcoGorelli commented 4 years ago

I've not (yet) looked into this more, but you're welcome to submit a pull request if you like https://pandas.pydata.org/pandas-docs/stable/development/contributing.html

Dr-Irv commented 4 years ago

This works fine in pandas 1.1.1

MarcoGorelli commented 4 years ago

This works fine in pandas 1.1.1

Any idea when it was fixed? It's probably good to make sure this was intentional and that there's a test for it...I'll do a git bisect

MarcoGorelli commented 4 years ago

If I've done git bisect correctly (which I'm not I have, see below) it looks like this was fixed in #33616

Could do with a test, so am reopening.

Why I'm not sure I've done git bisect correctly: there's a gcc error (@simonjayhawkins is this something you've come across?)

(pandas-dev) marco@marco-Predator-PH315-52:~/pandas-dev$ git checkout 1c0cc62e30a3077476e97f8e7e6ba17b4ac754b6
Previous HEAD position was ad8ce0be9 CLN: Clean missing.py (#33631)
HEAD is now at 1c0cc62e3 REF: get .items out of BlockManager.apply (#33616)
(pandas-dev) marco@marco-Predator-PH315-52:~/pandas-dev$ python setup.py build_ext -i -j 8
running build_ext
building 'pandas._libs.tslibs.nattype' extension
gcc -pthread -B /home/marco/miniconda/envs/pandas-dev/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DNPY_NO_DEPRECATED_API=0 -I./pandas/_libs/tslibs -I/home/marco/miniconda/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include -I/home/marco/miniconda/envs/pandas-dev/include/python3.8 -c pandas/_libs/tslibs/nattype.c -o build/temp.linux-x86_64-3.8/pandas/_libs/tslibs/nattype.o -Werror
building 'pandas._libs.interval' extension
gcc -pthread -B /home/marco/miniconda/envs/pandas-dev/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DNPY_NO_DEPRECATED_API=0 -I./pandas/_libs -Ipandas/_libs/src/klib -I/home/marco/miniconda/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include -I/home/marco/miniconda/envs/pandas-dev/include/python3.8 -c pandas/_libs/interval.c -o build/temp.linux-x86_64-3.8/pandas/_libs/interval.o -Werror
pandas/_libs/tslibs/nattype.c:5108:18: error: ‘__pyx_pw_6pandas_5_libs_6tslibs_7nattype_4_NaT_11__div__’ defined but not used [-Werror=unused-function]
 5108 | static PyObject *__pyx_pw_6pandas_5_libs_6tslibs_7nattype_4_NaT_11__div__(PyObject *__pyx_v_self, PyObject *__pyx_v_other) {
      |                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pandas/_libs/interval.c:8278:18: error: ‘__pyx_pw_6pandas_5_libs_8interval_8Interval_25__div__’ defined but not used [-Werror=unused-function]
 8278 | static PyObject *__pyx_pw_6pandas_5_libs_8interval_8Interval_25__div__(PyObject *__pyx_v_self, PyObject *__pyx_v_y) {
      |                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
cc1: all warnings being treated as errors
error: command 'gcc' failed with exit status 1

EDIT

@dsaxton I saw you've brought something similar up in the Gitter chat, were you able to resolve it?

jorisvandenbossche commented 4 years ago

@MarcoGorelli thanks for the analysis!

dsaxton commented 4 years ago

@MarcoGorelli I found that building instead with the command CFLAGS='-Wno-error=deprecated-declarations' python setup.py build_ext -i generally fixes things, although I'm not sure if it'll work in this case. There's a thread about these problems here: https://github.com/pandas-dev/pandas/issues/33315

simonjayhawkins commented 4 years ago

Why I'm not sure I've done git bisect correctly: there's a gcc error (@simonjayhawkins is this something you've come across?)

I've set up a workflow for bisecting. didn't see that error but added || exit 125 to runner script to skip failed builds.

https://github.com/simonjayhawkins/pandas/runs/1078479989?check_suite_focus=true

agrees that #33616 fixed.

MarcoGorelli commented 4 years ago

I've set up a workflow for bisecting.

wow, nice!!

GabrielSimonetto commented 4 years ago

take

GabrielSimonetto commented 4 years ago

Ok, I'm stuck.

After investigating PR 33616, we check that 2 files have been changed: pandas/core/generic.py and pandas/core/internals/managers.py, although they seem tighly correlated, and although the generic.py is directly reindexing stuff, after using some breakpoints with the following code, I've noted that the generic.py portion is not called upon.

df = pd.DataFrame([[1,2,5,6],
                    [3,4,7,8]])
df.columns=[0,1,0,1]
df[df>5]

Besides that, grepping I've found out that the exception mentioned in this issue is only raised on the function _can_reindex(), and, this function in only used on reindex_indexer() which should make it easy to debug how the error happens

(venv) [bigode@coala pandas]$ grep -r _can_reindex
core/indexes/base.py:    def _can_reindex(self, indexer):
core/internals/managers.py:            self.axes[axis]._can_reindex(indexer)

The problem is, after breakpointing both functions, they are never called on this operation! Which means, that the fix on pandas/core/internals/managers.py actively made the code avoid a section which should never get into. Which is supported by the comments @jreblack inserted:

# The caller is responsible for ensuring that
#  obj.axes[-1].equals(self.items)

I was already a bit stuck on which should be the specific test before...

(I was rehearsing something with pandas.core.internals.managers.BlockManager.{reindex_indexer, reindex_axis}, but I could not confirm they are being used since the only entrypoint I could confirm was the aforementioned internals.managers.apply(), and actually, inserting a breakpoint on reindex_indexer and reindex_axis didn't work on the test code. Which makes me think they are not being called, as absurd as that sounds),

...but now I'm completely lost. If someone could shed some light on the issue that would be awesome. Besides that, if I have some spare time I will try to use a pandas version prior to PR 33616 to see if I can pinpoint what exact interaction fixed this issue.

Dr-Irv commented 4 years ago

@GabrielSimonetto To address this issue, you only need to add a test that demonstrates that the bug was fixed. Don't worry about the internals. What happened here is that I saw the issue was fixed, and closed it, then @MarcoGorelli wanted to figure out where it was fixed, and we reopened it deciding we just needed a test to make sure that the issue is truly addressed.

MarcoGorelli commented 4 years ago

Yup :smile: @GabrielSimonetto if you wanted to submit a test to make sure this doesn't break again in the future, that would be welcome!

GabrielSimonetto commented 4 years ago

@Dr-Irv would you know where would be the right module to insert this test? If I understood correctly just a high level check will be enough?

MarcoGorelli commented 4 years ago

@GabrielSimonetto You can use the example provided by in https://github.com/pandas-dev/pandas/issues/31954#issuecomment-585940827 as a test

If you open a pull request you can put it in where you think a sensible location is and if necessary we'll ask you to put it somewhere else

GabrielSimonetto commented 4 years ago

Great @MarcoGorelli! I'm on it, thanks!

pandas-dev / pandas