sintel-dev / Orion

A machine learning library for detecting anomalies in signals.
https://sintel.dev/Orion/
MIT License
1.04k stars 160 forks source link

Preprocessing non-contiguous segments #171

Open sarahmish opened 3 years ago

sarahmish commented 3 years ago

Currently most pipelines share the same preprocessing primitives and in the following order:

  1. mlprimitives.custom.timeseries_preprocessing.time_segments_aggregate this makes the signal equi-spaced based on the specified interval.

  2. sklearn.impute.SimpleImputer for imputing missing values.

  3. sklearn.preprocessing.MinMaxScaler normalizing the data between a specified range.

  4. mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences creating multiple training window examples based on the window_size and step_size.

However, it is not always the case that we want to make the signal equi-spaced but rather retain the gaps within the signal. For this task, there are two main considerations that need to happen.

  1. normalize the data first to maintain the specified range.
  2. create segments based on the suggested max_gap, then for each segment apply the primitive 1, 2 & 4 shown above, then concatenate them together.

the sequence of preprocessing primitives would be:

"sklearn.preprocessing.MinMaxScaler",
"orion.primitives.timeseries_preprocessing.segment", # suggested
"mlprimitives.custom.timeseries_preprocessing.time_segments_aggregate",
"sklearn.impute.SimpleImputer",
"mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences",
"orion.primitives.timeseries_preprocessing.concatenate" # suggested
kb1ooo commented 2 years ago

I don't see any activity here, but I'm wondering if this may have been addressed since Feb?

sarahmish commented 2 years ago

Hi @kb1ooo! It's still under works

kb1ooo commented 2 years ago

@sarahmish thanks. Is there some work on it checked into a branch?

sarahmish commented 2 years ago

There isn't an active branch on this case. The primary change for this feature is in the rolling_window_sequences primitive. It currently works by slicing based on indexes. To make this change, we need to introduce slicing by timestamps and using a max_gap parameter to indicate the maximum gaps to between one element and another.

kb1ooo commented 2 years ago

@sarahmish ok right. Is there a simpler intermediate version where basically the data is pre-segmented (i.e. don't delegate the segmentation logic to orion, let it be the responsibility of the caller), and you would pass the data as say a list of dataframes instead of one dataframe? Then just iterate through the list, applying the same pipeline, and concatenate the rolling_window_sequences.

sarahmish commented 2 years ago

@kb1ooo that's definitely possible. Mechanically, you can just iterate over each dataframe calling orion.fit as a simple work around. My only concern is that you will be training the ML model on epochs with different batches each time. I don't know how that will affect the learning of the underlying model.