Open ghost opened 4 years ago
pls make a reproducible example
Hi @jreback , I have tried all kinds of smaller examples, but unfortunately I could never reproduce the error.... I also tried to save the calculations as *.pkl. However, after loading these are correctly concatenated without any problems... I'm so sorry, but maybe someone has an idea in front of them it might be... It seems to occur after the calculation from a sklearn.CountVectorizer, which I do like this:
def sklearn_count_vectorizer(df_data, column, **kwargs):
"""Apply the CountVectorizer function from the sklearn-library.
:param df_data: pandas.DataFrame to which the function is to be applied.
:param column: Column names to which the function is to be applied.
:return: A pandas.DataFrame with the name of the counted words.
"""
analyzer = kwargs.get('analyzer', None)
df_tmp = df_data.copy()
if analyzer is not None:
stemmer = analyzer
def stemmed_words(doc):
return (stemmer.stem(w) for w in analyzer(doc))
kwargs['analyzer'] = stemmed_words
analyzer = CountVectorizer().build_analyzer()
count_vectorizer = CountVectorizer(**kwargs)
data_trainsformed = count_vectorizer.fit_transform(df_tmp[column])
df_result = pd.DataFrame(data=data_trainsformed.toarray(),
columns=count_vectorizer.get_feature_names())
df_data_result = pd.concat([df_tmp, df_result], axis=1, join='inner')
return df_data_result
if you don't have a reproducible example then this issue will be closed
Unfortunately I cannot find one at the moment.
@Foxly-beep you might already have done this, but what I would try: starting from your original data that gives the error, try to reduce the number of rows / columns stepwise, and see at which point it no longer fails (that point might also be interesting: is it a certain number of rows/columns that triggers the error, or did you remove a certain kind (eg dtype) of rows/columns in that step, ...)
@jorisvandenbossche Thank you very much. I found the mistake. Here is an example:
import pandas as pd
from feature_engineering import sklearn_count_vectorizer
df3 = pd.DataFrame([['He believes that CoESS and UNI-Europe also have to include individual chambers and trade unions at the national level in'],
['f different models of democracy; that we have different rules on electing representatives to the parliaments; that the United Kingdom\'s political system differs from the Czech; that financing political parties in Germany is different from that in Sweden etc.']],
columns=['text'],
index=[1,2]).reset_index()
df4 = pd.DataFrame([['It is my expectation that the critical examination underway between the EU institutions, national organizations and non-governmental organizations will produce an actionable Commission proposal by the beginning of 2011 for a comprehensive instrument providing common minimum standards for international solidarity with the victims of terrorism, including measures to address the compensation to EU citizens who suffer a terrorist attack outside of the EU. '],
['As opposed to the Church of Scientology, Index we don\'t consider Scientology a religion, but rather a philosophy of cognition which enables man to find answers to questions that of course reach into the realm of religion, such as, Where do I come from']],
columns=['text'],
index=[3,4]).reset_index()
sklearn_count_result_1 = sklearn_count_vectorizer(df3,
column='text',
stop_words='english')
sklearn_count_result_2 = sklearn_count_vectorizer(df4,
column='text',
stop_words='english')
pd.concat([sklearn_count_result_1, sklearn_count_result_2])
using the method I posted above. Thank you very much!
By the way: These are completely random texts in the columns!
The error can be eliminated by setting drop=True at reset_index().
@Foxly-beep thanks, I can reproduce that now. Looking into the dataframes in question, I think it is cuased by one of the words also being "index", and thus leading to a duplicate column (and this is then apparently a buggy case).
From that observation, trying to create a smaller, pandas-only reproducible example:
df1 = pd.DataFrame([[1, 'some text', 0, 0, 0], [2, 'more text', 0, 0, 0]],
columns=['index', 'text', 'word1', 'word3', 'word4'])
df2 = pd.DataFrame([[3, 'some text', 0, 0, 0], [4, 'more text', 0, 0, 0]],
columns=['index', 'text', 'word2', 'word3', 'index'])
In [56]: pd.concat([df1, df2])
...
AttributeError: 'NoneType' object has no attribute 'is_extension'
Even a bit more simple example (without the object column in between):
df1 = pd.DataFrame(np.random.randn(2, 3), columns=['A', 'B', 'D'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['A', 'C', 'A'])
pd.concat([df1, df2])
also raises "AttributeError: 'NoneType' object has no attribute 'is_extension'"
With some other alignment (eg df2 now doesn't have a new C columns but existing B), you can also get a different error:
df1 = pd.DataFrame(np.random.randn(2, 3), columns=['A', 'B', 'D'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['A', 'B', 'A'])
pd.concat([df1, df2])
which gives "ValueError: Plan shapes are not aligned"
@jorisvandenbossche thank you! Yeah, the double column thing makes sense. Thanks!
@jorisvandenbossche Just to check if I understood what's the desired out.
In the first example you gave in the latest post, you'd want the output to be something like this:
A | B | D | C | A |
---|---|---|---|---|
0.388762 | 0.710653 | 0.474394 | NaN | NaN |
0.345786 | -0.976774 | 0.533432 | NaN | NaN |
-0.108304 | NaN | NaN | -0.688178 | -0.194421 |
-0.587845 | NaN | NaN | 0.176223 | 0.159187 |
and for the second example, something like this:
A | B | D | A |
---|---|---|---|
0.388762 | 0.710653 | 0.474394 | NaN |
0.345786 | -0.976774 | 0.533432 | NaN |
2.481166 | -1.395860 | NaN | 1.105540 |
-0.586868 | 0.803194 | NaN | 0.416715 |
I have exactly the same problem as when some columns are contained in both dataframes to concat
@jorisvandenbossche Hi, I'm willing to have a try to fix this bug. I want to figure out what is expected result of the example you gave above? And what if the below's:
df1 = pd.DataFrame(np.random.randn(2, 3), columns=['A', 'B', 'A'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['A', 'C', 'A'])
pd.concat([df1, df2])
[x] I have searched the [pandas] tag on StackOverflow for similar questions.
[x] I have asked my usage related question on StackOverflow.
Question about pandas
Hi, I have a persistent problem with concatenating multiple DataFrames with shapes:
If I want to concatenate this, or even any part of it with
I get the following error:
I found out in my research, that blocks in the join_units are sometimes None. But I don't understand why this is so... All table entries in my DataFrames are not None/NaN. Unfortunately I can't post the data here, because they are very extensive. Maybe it helps to know that I split the rows in my original dataframe for multiprocessing. Afterwards I will concatenate them again, see above. Thanks a lot!