pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.66k stars 17.91k forks source link

ENH: Warning with only one column used in `apply` causing performance issues #44490

Open crosspolar opened 2 years ago

crosspolar commented 2 years ago

Is your feature request related to a problem?

From what I've seen in python classes: Handling a huge data frame, especially beginners using apply on huge data frames don't know that it loads the whole frame while looping.

Example: Consider following function and data frame

  import pandas as pd

  def complex_function(x, y=0):
      if x > 5 and x > y:
          return 1
      else:
          return 2

  df = pd.DataFrame(data={'col1': [1, 4, 6, 2, 7], 'col2': [6, 7, 1, 2, 8]})

For a greater data frame, consider performance differences with

df['col1'] = df['col1'].apply(complex_function)

and the much less efficient

df['col1'] = df.apply(function(x) {complex_function(x['col1'])})

Describe the solution you'd like

Just a small warning that, if only one column is accessed within apply body. Something like:

Warning: Only one column seems to be accessed, causing performance issues

Describe alternatives you've considered

We just leave it as it is, no great differences for most users

mzeitlin11 commented 2 years ago

Thanks for the request @crosspolar! Do you have a suggestion for how this case could be checked for efficiently?

iansheng commented 2 years ago

Hi! Intereting Idea. When I learn pandas, I often encounter performance problems with apply. Can u describe it in more details? And, I can't run the second example, neither can figure out what it means.

and the much less efficient

df['col1'] = df.apply(function(x) {complex_function(x['col1'])})