pycaret / pycaret

An open-source, low-code machine learning library in Python
https://www.pycaret.org
MIT License
8.7k stars 1.75k forks source link

[BUG]: Too many anomalies found, makes no sense #3266

Open davidfombella opened 1 year ago

davidfombella commented 1 year ago

pycaret version checks

Issue Description

I am using slim version of python package pycaret.__version__ '2.3.10'

Trying to reproduce the code in this medium page [(https://towardsdatascience.com/time-series-anomaly-detection-with-pycaret-706a6e2b2427)]

Which uses anomaly detection iforest # train model iforest = create_model('iforest', fraction = 0.1) iforest_results = assign_model(iforest) iforest_results.head()

At some point something happens and all points are considered anomalies...makes no sense.

Capture

Regards

Reproducible Example

#!/usr/bin/env python
# coding: utf-8

# Para exportar imagenes plotly
# pip install kaleido
# pip install -U kaleido

# https://towardsdatascience.com/time-series-anomaly-detection-with-pycaret-706a6e2b2427
# 
# 👉 Installing PyCaret
# 
# Installing PyCaret is very easy and takes only a few minutes. We strongly recommend using a virtual environment to avoid potential conflicts with other libraries.
# 
# PyCaret’s default installation is a slim version of pycaret which only installs hard dependencies that are listed here.

# In[ ]:

# install slim version (default)
# pip install pycaret
# install the full version
# pip install pycaret[full]

# In[1]:

import pandas as pd
import plotly.express as px

# In[20]:

#import pycaret
#pycaret.__version__
#'2.3.10'

# In[2]:

from pycaret.anomaly import *

# In[3]:

#data = pd.read_csv('https://raw.githubusercontent.com/numenta/NAB/master/data/realKnownCause/nyc_taxi.csv')
data = pd.read_csv('nyc_taxi.csv')
data['timestamp'] = pd.to_datetime(data['timestamp'])
data.head()

# In[4]:

# create moving-averages
data['MA48'] = data['value'].rolling(48).mean()
data['MA336'] = data['value'].rolling(336).mean() 
data.tail(6)

# In[6]:

#plot
fig = px.line(data, 
              x="timestamp", 
              y=['value', 'MA48', 'MA336'], 
              title='NYC Taxi Trips', 
              template = 'plotly_dark')
fig.show()

# ## Data Preparation
# 
# Since algorithms cannot directly consume date or timestamp data, we will extract the features from the timestamp and will drop the actual timestamp column before training models.

# In[7]:

# drop moving-average columns
data.drop(['MA48', 'MA336'], axis=1, inplace=True)

# In[8]:

# set timestamp to index
data.set_index('timestamp', drop=True, inplace=True)

# In[9]:

# resample timeseries to hourly 
data = data.resample('H').sum()

# In[10]:

# creature features from date
data['day']    = [i.day for i in data.index]
data['day_name'] = [i.day_name() for i in data.index]
data['day_of_year'] = [i.dayofyear for i in data.index]
data['week_of_year'] = [i.weekofyear for i in data.index]
data['hour'] = [i.hour for i in data.index]
data['is_weekday'] = [i.isoweekday() for i in data.index]

data.head()

# In[11]:

data.head()

# ## Experiment Setup
# 
# Common to all modules in PyCaret, the setup function is the first and the only mandatory step to start any machine learning experiment in PyCaret. Besides performing some basic processing tasks by default, PyCaret also offers a wide array of pre-processing features. To learn more about all the preprocessing functionalities in PyCaret, you can see this link.
#  https://pycaret.gitbook.io/docs/

# Whenever you initialize the `setup` function in PyCaret, it profiles the dataset and infers the data types for all input features. In this case, you can see `day_name` and `is_weekday` is inferred as categorical and remaining as numeric. You can press enter to continue.

# In[12]:

# init setup
from pycaret.anomaly import *
s = setup(data, session_id = 123)

# ## Model Training
# 
# To check the list of all available algorithms:

# In[13]:

# check list of available models
models()

# In this tutorial, I am using **Isolation Forest**, but you can replace the ID ‘iforest’ in the code below with any other model ID to change the algorithm.
# 
# If you want to learn more about the Isolation Forest algorithm, you can refer to this.
# https://en.wikipedia.org/wiki/Isolation_forest

# In[14]:

# train model
iforest = create_model('iforest', fraction = 0.1)
iforest_results = assign_model(iforest)
iforest_results.head()

# Notice that two new columns are appended i.e. 
# Anomaly that contains value 1 for outlier and 0 for inlier and Anomaly_Score which is a continuous value a.k.a as decision function (internally, the algorithm calculates the score based on which the anomaly is determined).
# 
# `sample rows from iforest_results (FILTER to Anomaly == 1)`

# In[15]:

# check anomalies
iforest_results[iforest_results['Anomaly'] == 1].head()

# In[18]:

# check anomalies
iforest_results[iforest_results['Anomaly'] == 1].tail()

# ## Plot anomalies
# We can now plot anomalies on the graph to visualize.

# In[16]:

import plotly.graph_objects as go

# In[17]:

# plot value on y-axis and date on x-axis
fig = px.line(iforest_results, 
              x=iforest_results.index, 
              y="value", 
              title='NYC TAXI TRIPS - UNSUPERVISED ANOMALY DETECTION', 
              template = 'plotly_dark')

# create list of outlier_dates
outlier_dates = iforest_results[iforest_results['Anomaly'] == 1].index

# obtain y value of anomalies to plot
y_values = [iforest_results.loc[i]['value'] for i in outlier_dates]

fig.add_trace(go.Scatter(x=outlier_dates, 
                         y=y_values, 
                         mode = 'markers', 
                         name = 'Anomaly', 
                         marker=dict(color='red',size=10)))

fig.show()

fig.write_image("NYC_trips with Anomalies.png") 

# Notice that the model has picked several anomalies around Jan 1st which is a new year eve. The model has also detected a couple of anomalies around Jan 18— Jan 22 which is when the North American blizzard (a fast-moving disruptive blizzard) moved through the Northeast dumping 30 cm in areas around the New York City area.
# 
# If you google the dates around the other red points on the graph, you will probably be able to find the leads on why those points were picked up as anomalous by the model (hopefully).
# 
# I hope you will appreciate the ease of use and simplicity in PyCaret. In just a few lines of code and few minutes of experimentation, I have trained an unsupervised anomaly detection model and have labeled the dataset to detect anomalies on a time series data.

# Coming Soon!
# 
# Next week I will be writing a tutorial on training custom models in PyCaret using PyCaret Regression Module. You can follow me on Medium, LinkedIn, and Twitter to get instant notifications whenever a new tutorial is released.
# 
# There is no limit to what you can achieve using this lightweight workflow automation library in Python. If you find this useful, please do not forget to give us ⭐️ on our GitHub repository.
# 
# To hear more about PyCaret follow us on LinkedIn and Youtube.
# 
# Join us on our slack channel. Invite link here.

# In[ ]:

# In[ ]:

# In[ ]:

# In[ ]:

# In[ ]:

# In[ ]:

# In[ ]:

# In[ ]:

Expected Behavior

Too many anomalies found, makes no sense

Actual Results

Too much anomalies found

Installed Versions

2.3.10
tvdboom commented 1 year ago

Hi @davidfombella, the anomaly detection module hasn't been maintained for a while now and it will be removed in an upcoming release. For that reason, we won't fix any bugs currently in the module.