Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Since I started using Pandas, I have exclusively used syntax like df[df.A > 10] to select rows of a DataFrame, or elements therein (df[df.A > 10].B) matching some condition. (By the way, is there a name for this style of selection?) I probably picked up this technique from "10 minutes to pandas" (https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#setting), specifically
A where operation with setting.
In [52]: df2 = df.copy()
In [53]: df2[df2 > 0] = -df2
While that section intends to describe how to set values, it also by chance shows selection with the df[df.A > 10] method. As far as I can tell, this is the only time this page shows how to perform conditional selection. Since selecting part of a DataFrame is such a common operation, I suggest adding a specific subsection under https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#selection for Conditional selection or similar. As discussed below, it should illustrate both df[df.A > 10] and query.
I would also like to suggest an addition to https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#query-python-versus-pandas-syntax-comparison and the other places where df[df.A > 10] selection appears. Suppose one has a DataFrame with columns including 'A1', 'A2', 'A3', ..., B, and that one wishes to run some process on subsets of each column. Rather than building the list of columns and iterating over them, like with for col in ['A1', 'A2', 'A3']:, I prefer to use
for col_suffix in [1,2,3]:
these_data = df[df[f'A{col_suffix}'] > 0]
x = these_data.B
y = these_data[f'A{col_suffix}']
I have not seen this particular idiom in the pandas docs. While the standard advice (https://stackoverflow.com/questions/2259224/python-inserting-a-variable-value-into-a-variable-name) is to avoid such constructions, DataFrames often come with collections of column names like ['A1', 'A2', 'A3'], so I have found this "dynamic" approach quite useful—particularly when translating code that I wrote and tested on a single column name to operate on many columns.
Differences in the output of df[df.A > 10] and query
Finally, in selecting with df[df.A > 10], I have encountered a warning that does not appear with query. I am running this code in a Jupyter notebook. When I perform the selection with
nonzero_data = df[df[f'A{col_suffix} %'] > 0]
and then plot two columns from nonzero_data against one another, I get the following error:
.../lib/python3.6/site-packages/matplotlib/cbook/__init__.py:2062: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version. Convert to a numpy array before indexing instead.
x[:, None]
If I run the same code but replace the selection line with
nonzero_data = df.query(f"`A{col_suffix} %` > 0")
no error appears. I can see that matplotlib is throwing the error, but the fact that it depends on how I select the data in pandas suggests the latter is returning objects that differ in some way with each selection method. While I suspect this behavior is documented, I was not able to locate that information.
In both cases, the x and y values being plotted appear to be identical. Calling display(<var>) just before plt.plot(x,y), I find that x is
The error disappears, even selecting with nonzero_data = df[df[f'A{col_suffix} %'] > 0], if I recast x and y as numpy arrays. What, then, is the difference in the DataFrames that query or df[df[f'A{col_suffix} %'] > 0] return?
Location of the documentation
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#setting https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#selection https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#query-python-versus-pandas-syntax-comparison
Documentation problem and suggested fixes
Selection with
df[df.A > 10]
Since I started using Pandas, I have exclusively used syntax like
df[df.A > 10]
to select rows of a DataFrame, or elements therein (df[df.A > 10].B
) matching some condition. (By the way, is there a name for this style of selection?) I probably picked up this technique from "10 minutes to pandas" (https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#setting), specificallyWhile that section intends to describe how to set values, it also by chance shows selection with the
df[df.A > 10]
method. As far as I can tell, this is the only time this page shows how to perform conditional selection. Since selecting part of a DataFrame is such a common operation, I suggest adding a specific subsection under https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#selection forConditional selection
or similar. As discussed below, it should illustrate bothdf[df.A > 10]
andquery
.How to select conditionally
From the "10 minutes" page, one might wish to learn more about how to select conditionally. The obvious place to look is https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html, since the "10 minutes" page specifically mentions
where
just beforedf2[df2 > 0]
. However,where
is intended for replacement, not selection. I suggest adding a note to thewhere
page that points to https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html, sincequery
is the appropriate tool for conditional selection.Dynamic construction of column names
I would also like to suggest an addition to https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#query-python-versus-pandas-syntax-comparison and the other places where
df[df.A > 10]
selection appears. Suppose one has a DataFrame with columns including'A1', 'A2', 'A3', ..., B
, and that one wishes to run some process on subsets of each column. Rather than building the list of columns and iterating over them, like withfor col in ['A1', 'A2', 'A3']:
, I prefer to useI have not seen this particular idiom in the pandas docs. While the standard advice (https://stackoverflow.com/questions/2259224/python-inserting-a-variable-value-into-a-variable-name) is to avoid such constructions, DataFrames often come with collections of column names like
['A1', 'A2', 'A3']
, so I have found this "dynamic" approach quite useful—particularly when translating code that I wrote and tested on a single column name to operate on many columns.Differences in the output of
df[df.A > 10]
andquery
Finally, in selecting with
df[df.A > 10]
, I have encountered a warning that does not appear withquery
. I am running this code in a Jupyter notebook. When I perform the selection withand then plot two columns from
nonzero_data
against one another, I get the following error:If I run the same code but replace the selection line with
no error appears. I can see that matplotlib is throwing the error, but the fact that it depends on how I select the data in pandas suggests the latter is returning objects that differ in some way with each selection method. While I suspect this behavior is documented, I was not able to locate that information.
In both cases, the
x
andy
values being plotted appear to be identical. Callingdisplay(<var>)
just beforeplt.plot(x,y)
, I find thatx
isand
y
isThe error disappears, even selecting with
nonzero_data = df[df[f'A{col_suffix} %'] > 0]
, if I recastx
andy
as numpy arrays. What, then, is the difference in the DataFrames thatquery
ordf[df[f'A{col_suffix} %'] > 0]
return?