pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.62k stars 17.91k forks source link

dataframe.appy behavior changed between 0.20.0 and 0.20.3 #20625

Closed dwjang closed 6 years ago

dwjang commented 6 years ago
import pandas as pd
from pyspark.ml.linalg import Vectors

df = pd.DataFrame({'A': [1,2,3,4], 
                   'B': [1,2,3,4], 
                   'C': [1,2,3,4],
                   'D': [1,2,3,4]},
                   index=[0, 1, 2, 3])
df.apply(lambda x: pd.Series(Vectors.dense([x["A"], x["B"]])), axis=1)

Problem description

This produces from pandas 0.20.0:

            0
0  [1.0, 1.0]
1  [2.0, 2.0]
2  [3.0, 3.0]
3  [4.0, 4.0]

but it is different in pandas 0.20.3:

     0    1
0  1.0  1.0
1  2.0  2.0
2  3.0  3.0
3  4.0  4.0

How can I achieve the first behavior in 0.20.3? I am not saying the current behavior is worse than the old one. I don't see any description in "What's New". I need the old behavior to work with DenseVector in PySpark MLlib. Thanks,

jreback commented 6 years ago

would need an example that is only pandas and fyi this is revamped in forthcoming 0.23

dwjang commented 6 years ago

Is it achievable in 0.20.3?

jreback commented 6 years ago

you can do this in 0.23.0

if prior versions, don't use a pd.Series wrapper

In [13]: df = pd.DataFrame({'A': [1,2,3,4], 
    ...:                    'B': [1,2,3,4], 
    ...:                    'C': [1,2,3,4],
    ...:                    'D': [1,2,3,4]},
    ...:                    index=[0, 1, 2, 3])
    ...: df.apply(lambda x: [x["A"], x["B"]], axis=1, result_type='reduce')
Out[13]: 
0    [1, 1]
1    [2, 2]
2    [3, 3]
3    [4, 4]
dtype: object
dwjang commented 6 years ago

I need to preserve the object, "DenseVector" which is a requited object for MLlib input type. Your prescription won't work.