Open nathanjmcdougall opened 1 month ago
I don't think we should create a function that can be achieved by 1 line of code just because that line of code is not readable. Code readability is subjective, but you can use an if statement to make it more readable (although it's a bit redundant):
if v.shape[0] != 0:
is_constant = (s[0] == s).all()
else:
is_constant = True
There was an issue suggesting the same feature (#54033) but got closed without any discussion, we can continue the discussion here. I'm ok with adding this after reading @sbrugman's valid points in the original issue.
This will be a great addition
I don't think we should create a function that can be achieved by 1 line of code just because that line of code is not readable. Code readability is subjective, but you can use an if statement to make it more readable (although it's a bit redundant):
if v.shape[0] != 0: is_constant = (s[0] == s).all() else: is_constant = True
There was an issue suggesting the same feature (#54033) but got closed without any discussion, we can continue the discussion here. I'm ok with adding this after reading @sbrugman's valid points in the original issue.
The is_unique
function is also concise, consisting of just one line of code. Adding this as an official method, rather than leaving it as a recipe, may enhance code consistency (having both is_unique
and is_constant
methods) and guide users towards a more performant option.
The proposed (s[0] == s).all()
is error-prone in edge cases (What if s[0] is NaN? What if s is empty?), and actually slower for small Series. Going this route, one should do a .dropna()
and .values
/.array
first.
array = s.dropna().values
is_constant = array.shape[0] == 0 or (array[0] == array).all()
I posted my finding here: https://github.com/astral-sh/ruff/issues/11910. However, this solution is still O(N) and not short-circuiting. For large Series that are non-constant with high likelihood, naive python code can be orders of magnitude faster.
import pandas as pd
import numpy as np
def is_constant(array):
if len(array) <= 1:
return True
first = array[0]
return all(item == first for item in array)
const = pd.Series(np.ones(1_000_000)).values
irreg = pd.Series(np.random.randn(1_000_000)).values
%timeit is_constant(const) # 72.7 ms ± 1.66 ms
%timeit (const[0] == const).all() # 144 µs ± 2.3 µs
%timeit is_constant(irreg) # 968 ns ± 6.7 ns
%timeit (irreg[0] == irreg).all() # 129 µs ± 132 ns
With numba-jit we can further drastically improve the performance
import pandas as pd
import numpy as np
import numba
@numba.njit
def is_constant(array):
if len(array) <= 1:
return True
first = array[0]
for item in array:
if item != first:
return False
return True
const = pd.Series(np.ones(1_000_000)).values
irreg = pd.Series(np.random.randn(1_000_000)).values
%timeit is_constant(const) # 457 µs ± 5.42 µs (instead of 72 ms)
%timeit (const[0] == const).all() # 136 µs ± 311 ns
%timeit is_constant(irreg) # 242 ns ± 1.52 ns (instead of 968 ns)
%timeit (irreg[0] == irreg).all() # 128 µs ± 2.15 µs
Feature Type
[X] Adding new functionality to pandas
[ ] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas
Problem Description
In the cookbook, a recipe is given for checking that a Series only contains constant values in a performant way:
https://pandas.pydata.org/docs/user_guide/cookbook.html#constant-series
is_constant = v.shape[0] == 0 or (s[0] == s).all()
To me, this has poor readability and is difficult to learn as an idiom because it requires the programmer to remember to check the edge case of
.shape[0] == [0]
, and to remember to check the cases of missing values / NaN values, which need to be handled differently (as explained in the cookbook).Feature Description
It would be nice to have a convenience function which provided a performant
is_constant
check on aSeries
.It could have optional arguments to configure how missing values are handled.
Alternative Solutions
The alternative is just to require the user to detect the poorly performant code, possibly automatically with a linter (see below), and come up with a performant solution for their case, possibly using the cookbook. Otherwise, the simple
.nunique(dropna=...) <= 1
solution is convenient enough for when performance is not a concern.Additional Context
I came across this when using a
pandas-vet
rule viaruff
: PD101I like the linter to detect performance issues like this one; but I prefer that they don't harm readability if possible.