pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.7k stars 17.93k forks source link

ENH: Introduce a built-in method for outlier detection #54080

Closed g3rley closed 1 year ago

g3rley commented 1 year ago

Feature Type

Problem Description

Outlier detection is a crucial step in data analysis to identify data points that deviate significantly from the majority of the dataset. Currently, pandas lacks a dedicated method for outlier detection, making it necessary for users to rely on external libraries or custom implementations. This issue proposes the addition of a built-in method in pandas for outlier detection to provide users with a convenient and integrated solution.

Feature Description


import pandas as pd
import numpy as np

class CustomDataFrame(pd.DataFrame):
    def __init__(self, data=None, index=None, columns=None, detect_outliers: bool = False):
        if data is not None:
            if detect_outliers:
                # Perform outlier detection on the data
                data = self.detect_outliers(data)

            # Proceed with DataFrame initialization
            super().__init__(data, index=index, columns=columns)
        else:
            super().__init__(index=index, columns=columns)

    def detect_outliers(self, data):
        # Calculate the mean and standard deviation of the data
        mean = data.mean()
        std = data.std()

        # Set the threshold for outliers (e.g., 3 standard deviations from the mean)
        threshold = 3 * std

        # Create a boolean mask for outliers
        outliers_mask = np.abs(data - mean) > threshold

        # Replace outliers with NaN
        data[outliers_mask] = np.nan

        return data

Alternative Solutions


import pandas as pd
import numpy as np

def detect_outliers(data, method='z-score', threshold=3):
    if method == 'z-score':
        # Calculate z-scores for each data point
        z_scores = (data - data.mean()) / data.std()

        # Identify outliers based on threshold
        outliers = data[abs(z_scores) > threshold]

    elif method == 'isolation-forest':
        # Use Isolation Forest algorithm for outlier detection
        from sklearn.ensemble import IsolationForest

        # Create an instance of Isolation Forest
        isolation_forest = IsolationForest(contamination='auto')

        # Fit the model and predict outliers
        outliers = data[isolation_forest.fit_predict(data) == -1]

    # ... Additional elif statements for other outlier detection methods ...

    return outliers

# Example usage
data = pd.Series([1, 2, 3, 100, 4, 5, 6, 200])
outliers = detect_outliers(data, method='z-score', threshold=2)
print(outliers)

Additional Context

No response

MarcoGorelli commented 1 year ago

thanks @g3rley for your suggestion but this is out-of-scope I'm afraid (and "outlier" is very subjective)

closing then, but thanks for the issue