pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.37k stars 17.83k forks source link

str.cat should return categorical data for categorical caller #20845

Open h-vetinari opened 6 years ago

h-vetinari commented 6 years ago

The str.cat-accessor works for Series and Index, and returns an object of the corresponding type:

s = pd.Series(['a', 'b', 'a'])
t = pd.Index(['a', 'b', 'a'])
## all of the following return the same Series
s.str.cat(s)
s.str.cat(t)
s.str.cat(s.values)
s.str.cat(list(s))
# 0    aa
# 1    bb
# 2    aa
# dtype: object

## all of the following return the same Index
t.str.cat(s)
t.str.cat(t)
t.str.cat(s.values)
t.str.cat(list(s))
# Index(['aa', 'bb', 'aa'], dtype='object')

But the data loses its property of being a category after str.cat, which is inconsistent, IMO

sc = s.astype('category')
tc = pd.Index(['a', 'b', 'a'], dtype='category') # conversion does not work, see #20843
sc.str.cat(s)
# 0    aa
# 1    bb
# 2    aa
# dtype: object
## as opposed to:
sc.str.cat(s).astype('category')
# 0    aa
# 1    bb
# 2    aa
# dtype: category
# Categories (2, object): [aa, bb]
tc.str.cat(s) # crashes, see # 20842

xref #20842 #20843

WillAyd commented 6 years ago

The return type here is part of the documentation (though perhaps could be improved):

https://pandas.pydata.org/pandas-docs/stable/categorical.html#string-and-datetime-accessors

FWIW I don't really see how you could return a Categorical after a concatenation and make guarantees about the returned metadata (ordering comes to mind here). IMO doing concat on a large array of values would in most cases generate a ton of unique values and defeat the purpose of a Categorical in the first place.

h-vetinari commented 6 years ago

@WillAyd Thanks for that reference in the docs (had seen it only in individual doc-strings). However, I don't think it's fair to assume what kind of data would result - I can imagine several cases where this would be sensible. I still find something worth considering, but at least there's an easy solution with .astype('category').