pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.94k stars 18.04k forks source link

BUG: concat along the index (axis=0) of two dataframes with duplicate column name fails #35240

Open ghost opened 4 years ago

ghost commented 4 years ago

Question about pandas

Hi, I have a persistent problem with concatenating multiple DataFrames with shapes:

  1. (48, 5674)
  2. (48, 9022)
  3. (48, 7340),
  4. (47, 6539)
  5. (47, 10369)
  6. (47, 17242)
  7. (47, 19248)
  8. (47, 14282)

If I want to concatenate this, or even any part of it with

pd.concat(df_list)

I get the following error:

Traceback (most recent call last):
  File "E:/OneDrive/Informatik Studium/KIT Master/SS20/AGD Praktikum/phase-2/1_code/MyTest.py", line 46, in <module>
    df_result = __parallelize_dataframe(func=apply_functions, df_data=df_train.copy(), config_tupels=config_tupels)
  File "E:/OneDrive/Informatik Studium/KIT Master/SS20/AGD Praktikum/phase-2/1_code/MyTest.py", line 22, in __parallelize_dataframe
    df_pool_result = pd.concat(pool_result[0:2])
  File "E:\venv\lib\site-packages\pandas\core\reshape\concat.py", line 284, in concat
    return op.get_result()
  File "E:\venv\lib\site-packages\pandas\core\reshape\concat.py", line 497, in get_result
    mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy
  File "E:\venv\lib\site-packages\pandas\core\internals\managers.py", line 2016, in concatenate_block_managers
    elif is_uniform_join_units(join_units):
  File "E:\venv\lib\site-packages\pandas\core\internals\concat.py", line 388, in is_uniform_join_units
    all(not ju.is_na or ju.block.is_extension for ju in join_units)
  File "E:\venv\lib\site-packages\pandas\core\internals\concat.py", line 388, in <genexpr>
    all(not ju.is_na or ju.block.is_extension for ju in join_units)
AttributeError: 'NoneType' object has no attribute 'is_extension'

I found out in my research, that blocks in the join_units are sometimes None. But I don't understand why this is so... All table entries in my DataFrames are not None/NaN. Unfortunately I can't post the data here, because they are very extensive. Maybe it helps to know that I split the rows in my original dataframe for multiprocessing. Afterwards I will concatenate them again, see above. Thanks a lot!

jreback commented 4 years ago

pls make a reproducible example

ghost commented 4 years ago

Hi @jreback , I have tried all kinds of smaller examples, but unfortunately I could never reproduce the error.... I also tried to save the calculations as *.pkl. However, after loading these are correctly concatenated without any problems... I'm so sorry, but maybe someone has an idea in front of them it might be... It seems to occur after the calculation from a sklearn.CountVectorizer, which I do like this:

def sklearn_count_vectorizer(df_data, column, **kwargs):
    """Apply the CountVectorizer function from the sklearn-library.

    :param df_data: pandas.DataFrame to which the function is to be applied.
    :param column: Column names to which the function is to be applied.

    :return: A pandas.DataFrame with the name of the counted words.
    """
    analyzer = kwargs.get('analyzer', None)
    df_tmp = df_data.copy()
    if analyzer is not None:
        stemmer = analyzer

        def stemmed_words(doc):
            return (stemmer.stem(w) for w in analyzer(doc))

        kwargs['analyzer'] = stemmed_words
        analyzer = CountVectorizer().build_analyzer()

    count_vectorizer = CountVectorizer(**kwargs)
    data_trainsformed = count_vectorizer.fit_transform(df_tmp[column])
    df_result = pd.DataFrame(data=data_trainsformed.toarray(),
                             columns=count_vectorizer.get_feature_names())
    df_data_result = pd.concat([df_tmp, df_result], axis=1, join='inner')
    return df_data_result
jreback commented 4 years ago

if you don't have a reproducible example then this issue will be closed

ghost commented 4 years ago

Unfortunately I cannot find one at the moment.

jorisvandenbossche commented 4 years ago

@Foxly-beep you might already have done this, but what I would try: starting from your original data that gives the error, try to reduce the number of rows / columns stepwise, and see at which point it no longer fails (that point might also be interesting: is it a certain number of rows/columns that triggers the error, or did you remove a certain kind (eg dtype) of rows/columns in that step, ...)

ghost commented 4 years ago

@jorisvandenbossche Thank you very much. I found the mistake. Here is an example:

import pandas as pd
from feature_engineering import sklearn_count_vectorizer

df3 = pd.DataFrame([['He believes that CoESS and UNI-Europe also have to include individual chambers and trade unions at the national level in'],
                    ['f different models of democracy; that we have different rules on electing representatives to the parliaments; that the United Kingdom\'s political system differs from the Czech; that financing political parties in Germany is different from that in Sweden etc.']],
                   columns=['text'],
                   index=[1,2]).reset_index()

df4 = pd.DataFrame([['It is my expectation that the critical examination underway between the EU institutions, national organizations and non-governmental organizations will produce an actionable Commission proposal by the beginning of 2011 for a comprehensive instrument providing common minimum standards for international solidarity with the victims of terrorism, including measures to address the compensation to EU citizens who suffer a terrorist attack outside of the EU. '],
                    ['As opposed to the Church of Scientology, Index we don\'t consider Scientology a religion, but rather a philosophy of cognition which enables man to find answers to questions that of course reach into the realm of religion, such as, Where do I come from']],
                   columns=['text'],
                   index=[3,4]).reset_index()

sklearn_count_result_1 = sklearn_count_vectorizer(df3,
                                                  column='text',
                                                  stop_words='english')
sklearn_count_result_2 = sklearn_count_vectorizer(df4,
                                                  column='text',
                                                  stop_words='english')

pd.concat([sklearn_count_result_1, sklearn_count_result_2])

using the method I posted above. Thank you very much!

By the way: These are completely random texts in the columns!

ghost commented 4 years ago

The error can be eliminated by setting drop=True at reset_index().

jorisvandenbossche commented 4 years ago

@Foxly-beep thanks, I can reproduce that now. Looking into the dataframes in question, I think it is cuased by one of the words also being "index", and thus leading to a duplicate column (and this is then apparently a buggy case).

From that observation, trying to create a smaller, pandas-only reproducible example:

df1 = pd.DataFrame([[1, 'some text', 0, 0, 0], [2, 'more text', 0, 0, 0]], 
                   columns=['index', 'text', 'word1', 'word3', 'word4'])
df2 = pd.DataFrame([[3, 'some text', 0, 0, 0], [4, 'more text', 0, 0, 0]],
                   columns=['index', 'text', 'word2', 'word3', 'index']) 

In [56]: pd.concat([df1, df2])                                                                                                                                                                                     
...
AttributeError: 'NoneType' object has no attribute 'is_extension'
jorisvandenbossche commented 4 years ago

Even a bit more simple example (without the object column in between):

df1 = pd.DataFrame(np.random.randn(2, 3), columns=['A', 'B', 'D']) 
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['A', 'C', 'A'])  
pd.concat([df1, df2])

also raises "AttributeError: 'NoneType' object has no attribute 'is_extension'"

With some other alignment (eg df2 now doesn't have a new C columns but existing B), you can also get a different error:

df1 = pd.DataFrame(np.random.randn(2, 3), columns=['A', 'B', 'D']) 
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['A', 'B', 'A'])  
pd.concat([df1, df2])

which gives "ValueError: Plan shapes are not aligned"

ghost commented 4 years ago

@jorisvandenbossche thank you! Yeah, the double column thing makes sense. Thanks!

gimseng commented 4 years ago

@jorisvandenbossche Just to check if I understood what's the desired out.

In the first example you gave in the latest post, you'd want the output to be something like this:

A B D C A
0.388762 0.710653 0.474394 NaN NaN
0.345786 -0.976774 0.533432 NaN NaN
-0.108304 NaN NaN -0.688178 -0.194421
-0.587845 NaN NaN 0.176223 0.159187

and for the second example, something like this:

A B D A
0.388762 0.710653 0.474394 NaN
0.345786 -0.976774 0.533432 NaN
2.481166 -1.395860 NaN 1.105540
-0.586868 0.803194 NaN 0.416715
Lip651 commented 4 years ago

I have exactly the same problem as when some columns are contained in both dataframes to concat

onshek commented 4 years ago

@jorisvandenbossche Hi, I'm willing to have a try to fix this bug. I want to figure out what is expected result of the example you gave above? And what if the below's:

df1 = pd.DataFrame(np.random.randn(2, 3), columns=['A', 'B', 'A'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['A', 'C', 'A'])
pd.concat([df1, df2])