pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.65k stars 17.58k forks source link

ENH: Add a Series method which checks whether a Series is constant #58806

Open nathanjmcdougall opened 1 month ago

nathanjmcdougall commented 1 month ago

Feature Type

Problem Description

In the cookbook, a recipe is given for checking that a Series only contains constant values in a performant way:

https://pandas.pydata.org/docs/user_guide/cookbook.html#constant-series

is_constant = v.shape[0] == 0 or (s[0] == s).all()

To me, this has poor readability and is difficult to learn as an idiom because it requires the programmer to remember to check the edge case of .shape[0] == [0], and to remember to check the cases of missing values / NaN values, which need to be handled differently (as explained in the cookbook).

Feature Description

It would be nice to have a convenience function which provided a performant is_constant check on a Series.

It could have optional arguments to configure how missing values are handled.

Alternative Solutions

The alternative is just to require the user to detect the poorly performant code, possibly automatically with a linter (see below), and come up with a performant solution for their case, possibly using the cookbook. Otherwise, the simple .nunique(dropna=...) <= 1 solution is convenient enough for when performance is not a concern.

Additional Context

I came across this when using a pandas-vet rule via ruff: PD101

I like the linter to detect performance issues like this one; but I prefer that they don't harm readability if possible.

Aloqeely commented 1 month ago

I don't think we should create a function that can be achieved by 1 line of code just because that line of code is not readable. Code readability is subjective, but you can use an if statement to make it more readable (although it's a bit redundant):

if v.shape[0] != 0:
    is_constant = (s[0] == s).all()
else:
    is_constant = True

There was an issue suggesting the same feature (#54033) but got closed without any discussion, we can continue the discussion here. I'm ok with adding this after reading @sbrugman's valid points in the original issue.

PushpitSB commented 1 month ago

This will be a great addition

miguelpgarcia commented 3 weeks ago

I don't think we should create a function that can be achieved by 1 line of code just because that line of code is not readable. Code readability is subjective, but you can use an if statement to make it more readable (although it's a bit redundant):

if v.shape[0] != 0:
    is_constant = (s[0] == s).all()
else:
    is_constant = True

There was an issue suggesting the same feature (#54033) but got closed without any discussion, we can continue the discussion here. I'm ok with adding this after reading @sbrugman's valid points in the original issue.

The is_unique function is also concise, consisting of just one line of code. Adding this as an official method, rather than leaving it as a recipe, may enhance code consistency (having both is_unique and is_constant methods) and guide users towards a more performant option.

randolf-scholz commented 3 weeks ago

The proposed (s[0] == s).all() is error-prone in edge cases (What if s[0] is NaN? What if s is empty?), and actually slower for small Series. Going this route, one should do a .dropna() and .values/.array first.

array = s.dropna().values
is_constant = array.shape[0] == 0 or (array[0] == array).all()

I posted my finding here: https://github.com/astral-sh/ruff/issues/11910. However, this solution is still O(N) and not short-circuiting. For large Series that are non-constant with high likelihood, naive python code can be orders of magnitude faster.

import pandas as pd
import numpy as np

def is_constant(array):
    if len(array) <= 1:
        return True
    first = array[0]
    return all(item == first for item in array)

const = pd.Series(np.ones(1_000_000)).values
irreg = pd.Series(np.random.randn(1_000_000)).values

%timeit is_constant(const)         # 72.7 ms ± 1.66 ms 
%timeit (const[0] == const).all()  # 144 µs ± 2.3 µs
%timeit is_constant(irreg)         # 968 ns ± 6.7 ns
%timeit (irreg[0] == irreg).all()  # 129 µs ± 132 ns

With numba-jit we can further drastically improve the performance

import pandas as pd
import numpy as np
import numba

@numba.njit
def is_constant(array):
    if len(array) <= 1:
        return True
    first = array[0]
    for item in array:
        if item != first:
            return False
    return True

const = pd.Series(np.ones(1_000_000)).values
irreg = pd.Series(np.random.randn(1_000_000)).values

%timeit is_constant(const)         # 457 µs ± 5.42 µs  (instead of 72 ms)
%timeit (const[0] == const).all()  # 136 µs ± 311 ns
%timeit is_constant(irreg)         # 242 ns ± 1.52 ns  (instead of 968 ns)
%timeit (irreg[0] == irreg).all()  # 128 µs ± 2.15 µs