Closed igorluppi closed 3 years ago
Moreover, doing this:
In [177]: new_df = df.reset_index(drop=True)
In [178]: new_df[new_df > 10]
~/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
3097 # trying to reindex on an axis with duplicates
3098 if not self.is_unique and len(indexer):
-> 3099 raise ValueError("cannot reindex from a duplicate axis")
3100
3101 def reindex(self, target, method=None, level=None, limit=None, tolerance=None):
ValueError: cannot reindex from a duplicate axis
So, it's 100% sure that we have no duplicates here, so what is going on?
Thanks @igorluppi
I just tried
df = pd.DataFrame(np.random.randn(150001, 792))
df[df>10]
and got no error - could you give us some more details about your dataframe? Do you still get the error if you only consider its head, or if you only use (say) its first 5 columns?
I have many dataframes, and a put all of them in a single one:
let df_items be a list of dataframes.
I got this error using:
df_final = pandas.concat(df_items, axis = 1)
However I verified that
df_final = reduce(lambda x, y: pandas.merge(x, y, left_index=True, right_index=True, how='outer'), df_items)
Works fine, I got the same result DF and when I apply df_final[df_final>10]
it works. But this method requires a long process to be done, concat is faster than it (at least 10 times faster).
Thanks for https://stackoverflow.com/questions/45885043/pandas-concat-cannot-reindex-from-a-duplicate-axis?rq=1 about this possible solution. But why the error happens?
Applying df[df>10] I got "cannot reindex from duplicate axis",
I got this error using: df_final = pandas.concat(df_items, axis = 1)
Sorry, I'm a bit confused, which command gave you the error - pd.concat
or df[df>10]
?
pd.concat
gave me the df_final
, this df_final
got that error when I use df_final[df_final>10]
The interesting part is, when I use the reduce
method and got a df_final2
, this one works, in another words df_final2[df_final2>10]
works fine.
Moreover,
In [14]: df_final.equals(df_final2)
Out[14]: False
But I didnt find where the difference is
@MarcoGorelli I found the why the problem is happening but this implies in another problem regarding the exception I got. Give me a second
@MarcoGorelli "cannot reindex from duplicate axis" should be broken in two messages: both "cannot reindex from duplicate index" and "cannot reindex from duplicate columns". I will explain why.
Why is that? Because all the messages and solutions I was looking for told me to took at the indexes, but in my case I found duplicated columns.
But why the second case worked? reduce(lambda x, y: pandas.merge(x, y, left_index=True, right_index=True, how='outer'), df_items)
In this case, when it finds a duplicated column, automatically it appended a string "_x"
to the duplicated, it became "duplicated_column_x"
It's not the case for concat
, it keeps the duplicated column name "duplicated_column"
.
My sugestion
Please change the exception, to be specific that the problem belongs to the column (or index). Just saying duplicate axis was a little bit confused to find the solution
Thanks @igorluppi
tbh I still can't reproduce the error:
df = pd.DataFrame([0, 1], columns=['a'])
new_df = pd.concat([df, df], axis=1)
new_df[new_df>0] # works
could you try coming up with a minimal reproducible example?
Ok, I will create a simple example
@MarcoGorelli
import pandas
import numpy as np
a = np.array([[1,2],[3,4]])
# DO NOT WORKS
b = np.array([[0.5,6],[7,8]])
# OR
# b = np.array([[.5,6],[7,8]])
# This one works fine:
# b = np.array([[5,6],[7,8]])
dfA = pandas.DataFrame(a)
# This works fine EVEN using .5, because the columns name is different
# dfA = pandas.DataFrame(a, columns=['a','b'])
dfB = pandas.DataFrame(b)
df_new = pandas.concat([dfA, dfB], axis = 1)
df_new[df_new>3]
~/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
3097 # trying to reindex on an axis with duplicates
3098 if not self.is_unique and len(indexer):
-> 3099 raise ValueError("cannot reindex from a duplicate axis")
3100
3101 def reindex(self, target, method=None, level=None, limit=None, tolerance=None):
ValueError: cannot reindex from a duplicate axis
Basically, using .5 or 0.5 in numpy there breaks the dataframe operation. This might be a problem with pandas + numpy .
The interesting part is: Numpy float values just break the code if we have duplication on columns name.
@igorluppi great, thanks! Could you edit this example into the original post?
@MarcoGorelli
For sure, it's done my friend!
@MarcoGorelli is it a bug ? Anything new ?
cc @jorisvandenbossche
should someone from numpy-dev look at this? @jorisvandenbossche @MarcoGorelli
should someone from numpy-dev look at this? @jorisvandenbossche @MarcoGorelli
I don't think so - I presume the core team is prioritising what'll be in the v1.0.2 release. I'm working on another issue at the moment but I plan to get back to this
Any news ? @MarcoGorelli @jorisvandenbossche
I've not (yet) looked into this more, but you're welcome to submit a pull request if you like https://pandas.pydata.org/pandas-docs/stable/development/contributing.html
This works fine in pandas 1.1.1
This works fine in pandas 1.1.1
Any idea when it was fixed? It's probably good to make sure this was intentional and that there's a test for it...I'll do a git bisect
If I've done git bisect
correctly (which I'm not I have, see below) it looks like this was fixed in #33616
Could do with a test, so am reopening.
Why I'm not sure I've done git bisect correctly: there's a gcc error (@simonjayhawkins is this something you've come across?)
(pandas-dev) marco@marco-Predator-PH315-52:~/pandas-dev$ git checkout 1c0cc62e30a3077476e97f8e7e6ba17b4ac754b6
Previous HEAD position was ad8ce0be9 CLN: Clean missing.py (#33631)
HEAD is now at 1c0cc62e3 REF: get .items out of BlockManager.apply (#33616)
(pandas-dev) marco@marco-Predator-PH315-52:~/pandas-dev$ python setup.py build_ext -i -j 8
running build_ext
building 'pandas._libs.tslibs.nattype' extension
gcc -pthread -B /home/marco/miniconda/envs/pandas-dev/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DNPY_NO_DEPRECATED_API=0 -I./pandas/_libs/tslibs -I/home/marco/miniconda/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include -I/home/marco/miniconda/envs/pandas-dev/include/python3.8 -c pandas/_libs/tslibs/nattype.c -o build/temp.linux-x86_64-3.8/pandas/_libs/tslibs/nattype.o -Werror
building 'pandas._libs.interval' extension
gcc -pthread -B /home/marco/miniconda/envs/pandas-dev/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DNPY_NO_DEPRECATED_API=0 -I./pandas/_libs -Ipandas/_libs/src/klib -I/home/marco/miniconda/envs/pandas-dev/lib/python3.8/site-packages/numpy/core/include -I/home/marco/miniconda/envs/pandas-dev/include/python3.8 -c pandas/_libs/interval.c -o build/temp.linux-x86_64-3.8/pandas/_libs/interval.o -Werror
pandas/_libs/tslibs/nattype.c:5108:18: error: ‘__pyx_pw_6pandas_5_libs_6tslibs_7nattype_4_NaT_11__div__’ defined but not used [-Werror=unused-function]
5108 | static PyObject *__pyx_pw_6pandas_5_libs_6tslibs_7nattype_4_NaT_11__div__(PyObject *__pyx_v_self, PyObject *__pyx_v_other) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pandas/_libs/interval.c:8278:18: error: ‘__pyx_pw_6pandas_5_libs_8interval_8Interval_25__div__’ defined but not used [-Werror=unused-function]
8278 | static PyObject *__pyx_pw_6pandas_5_libs_8interval_8Interval_25__div__(PyObject *__pyx_v_self, PyObject *__pyx_v_y) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
cc1: all warnings being treated as errors
error: command 'gcc' failed with exit status 1
@dsaxton I saw you've brought something similar up in the Gitter chat, were you able to resolve it?
@MarcoGorelli thanks for the analysis!
@MarcoGorelli I found that building instead with the command CFLAGS='-Wno-error=deprecated-declarations' python setup.py build_ext -i
generally fixes things, although I'm not sure if it'll work in this case. There's a thread about these problems here: https://github.com/pandas-dev/pandas/issues/33315
Why I'm not sure I've done git bisect correctly: there's a gcc error (@simonjayhawkins is this something you've come across?)
I've set up a workflow for bisecting. didn't see that error but added || exit 125
to runner script to skip failed builds.
https://github.com/simonjayhawkins/pandas/runs/1078479989?check_suite_focus=true
agrees that #33616 fixed.
I've set up a workflow for bisecting.
wow, nice!!
take
Ok, I'm stuck.
After investigating PR 33616, we check that 2 files have been changed:
pandas/core/generic.py
and pandas/core/internals/managers.py
, although they seem tighly correlated, and although the generic.py is directly reindexing stuff, after using some breakpoints with the following code, I've noted that the generic.py portion is not called upon.
df = pd.DataFrame([[1,2,5,6],
[3,4,7,8]])
df.columns=[0,1,0,1]
df[df>5]
Besides that, grepping I've found out that the exception mentioned in this issue is only raised on the function _can_reindex()
, and, this function in only used on reindex_indexer()
which should make it easy to debug how the error happens
(venv) [bigode@coala pandas]$ grep -r _can_reindex
core/indexes/base.py: def _can_reindex(self, indexer):
core/internals/managers.py: self.axes[axis]._can_reindex(indexer)
The problem is, after breakpointing both functions, they are never called on this operation! Which means, that the fix on pandas/core/internals/managers.py
actively made the code avoid a section which should never get into. Which is supported by the comments @jreblack inserted:
# The caller is responsible for ensuring that
# obj.axes[-1].equals(self.items)
I was already a bit stuck on which should be the specific test before...
(I was rehearsing something with pandas.core.internals.managers.BlockManager.{reindex_indexer, reindex_axis}
, but I could not confirm they are being used since the only entrypoint I could confirm was the aforementioned internals.managers.apply()
, and actually, inserting a breakpoint on reindex_indexer
and reindex_axis
didn't work on the test code. Which makes me think they are not being called, as absurd as that sounds),
...but now I'm completely lost. If someone could shed some light on the issue that would be awesome. Besides that, if I have some spare time I will try to use a pandas version prior to PR 33616 to see if I can pinpoint what exact interaction fixed this issue.
@GabrielSimonetto To address this issue, you only need to add a test that demonstrates that the bug was fixed. Don't worry about the internals. What happened here is that I saw the issue was fixed, and closed it, then @MarcoGorelli wanted to figure out where it was fixed, and we reopened it deciding we just needed a test to make sure that the issue is truly addressed.
Yup :smile: @GabrielSimonetto if you wanted to submit a test to make sure this doesn't break again in the future, that would be welcome!
@Dr-Irv would you know where would be the right module to insert this test? If I understood correctly just a high level check will be enough?
@GabrielSimonetto You can use the example provided by in https://github.com/pandas-dev/pandas/issues/31954#issuecomment-585940827 as a test
If you open a pull request you can put it in where you think a sensible location is and if necessary we'll ask you to put it somewhere else
Great @MarcoGorelli! I'm on it, thanks!
Code Sample
Problem description
It has a bug that combines numpy specific values and duplicated DataFrame column names when it's used a select operation, such as
df[df > 5]
. A exception is thrown saying "cannot reindex from duplicate axis", however It should not be, because:df.index.is_unique
isTrue
)df_new[df_new > 5]
float
orint
numpy values, so it should not change the behavior of the codeHowever the values in the numpy array DO changes the behavior of the DataFrame selection, if the DataFrame has duplicated column names.
Expected Output
Current Output
Output of
pd.show_versions()