pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.57k stars 17.9k forks source link

DOC: Methods of conditional selection #38868

Open alindman opened 3 years ago

alindman commented 3 years ago

Location of the documentation

https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#setting https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#selection https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#query-python-versus-pandas-syntax-comparison

Documentation problem and suggested fixes

Selection with df[df.A > 10]

Since I started using Pandas, I have exclusively used syntax like df[df.A > 10] to select rows of a DataFrame, or elements therein (df[df.A > 10].B) matching some condition. (By the way, is there a name for this style of selection?) I probably picked up this technique from "10 minutes to pandas" (https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#setting), specifically

A where operation with setting.
In [52]: df2 = df.copy()
In [53]: df2[df2 > 0] = -df2

While that section intends to describe how to set values, it also by chance shows selection with the df[df.A > 10] method. As far as I can tell, this is the only time this page shows how to perform conditional selection. Since selecting part of a DataFrame is such a common operation, I suggest adding a specific subsection under https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#selection for Conditional selection or similar. As discussed below, it should illustrate both df[df.A > 10] and query.

How to select conditionally

From the "10 minutes" page, one might wish to learn more about how to select conditionally. The obvious place to look is https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html, since the "10 minutes" page specifically mentions where just before df2[df2 > 0]. However, where is intended for replacement, not selection. I suggest adding a note to the where page that points to https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html, since query is the appropriate tool for conditional selection.

Dynamic construction of column names

I would also like to suggest an addition to https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#query-python-versus-pandas-syntax-comparison and the other places where df[df.A > 10] selection appears. Suppose one has a DataFrame with columns including 'A1', 'A2', 'A3', ..., B, and that one wishes to run some process on subsets of each column. Rather than building the list of columns and iterating over them, like with for col in ['A1', 'A2', 'A3']:, I prefer to use

for col_suffix in [1,2,3]:
    these_data = df[df[f'A{col_suffix}'] > 0]
    x = these_data.B
    y = these_data[f'A{col_suffix}']

I have not seen this particular idiom in the pandas docs. While the standard advice (https://stackoverflow.com/questions/2259224/python-inserting-a-variable-value-into-a-variable-name) is to avoid such constructions, DataFrames often come with collections of column names like ['A1', 'A2', 'A3'], so I have found this "dynamic" approach quite useful—particularly when translating code that I wrote and tested on a single column name to operate on many columns.

Differences in the output of df[df.A > 10] and query

Finally, in selecting with df[df.A > 10], I have encountered a warning that does not appear with query. I am running this code in a Jupyter notebook. When I perform the selection with

nonzero_data = df[df[f'A{col_suffix} %'] > 0]

and then plot two columns from nonzero_data against one another, I get the following error:

.../lib/python3.6/site-packages/matplotlib/cbook/__init__.py:2062: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
  x[:, None]

If I run the same code but replace the selection line with

nonzero_data = df.query(f"`A{col_suffix} %` > 0")

no error appears. I can see that matplotlib is throwing the error, but the fact that it depends on how I select the data in pandas suggests the latter is returning objects that differ in some way with each selection method. While I suspect this behavior is documented, I was not able to locate that information.

In both cases, the x and y values being plotted appear to be identical. Calling display(<var>) just before plt.plot(x,y), I find that x is

datetime
2020-12-20 21:48:00    106.140
2020-12-20 22:29:00    108.690
...
Name: B, dtype: float64

and y is

datetime
2020-12-20 21:48:00     6.980
2020-12-20 22:29:00     7.325
...
Name: A1 %, dtype: float64

The error disappears, even selecting with nonzero_data = df[df[f'A{col_suffix} %'] > 0], if I recast x and y as numpy arrays. What, then, is the difference in the DataFrames that query or df[df[f'A{col_suffix} %'] > 0] return?

jreback commented 3 years ago

i am really not even sure what you are asking

can u edit and boil this down

rhshadrach commented 3 years ago

@alindman I think you're referring to https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#boolean-indexing