Using Custom Multivariate Time-Series Data with Orion

makinno commented 1 year ago

Orion version: 0.5.1
Python version: 3.8
Operating System: Windows 11 Pro

Description

I am interested in using the Orion library for detecting anomalies in multivariate time-series data. Specifically, I would like to implement the available ML pipelines in Orion with my custom dataset, which is in the following CSV format:

timestamp  X-axis Acceleration  Y-axis Acceleration
0.002986       -0.109640            -0.048954
0.007177       -0.299223             0.103052
0.011266         0.193694             0.065050
0.015449       -0.450890             0.103052
0.019823       -0.412973             0.103052

I have reviewed the notebook tutorials and documentation, but I couldn't find a clear solution for integrating my custom data into the Orion pipelines.

Specific Questions:

How can I adapt my custom multivariate time-series data to work with Orion pipelines?
Are there any specific data preprocessing steps or formats I should consider to align my data with Orion's expectations?
Could you provide an example or guidance on how to create a pipeline for anomaly detection with my data?

I appreciate any help or guidance you can provide in getting started with Orion for my specific use case.

Thank you for your support!

sarahmish commented 1 year ago

Hi @makinno – thanks for using Orion!

I will try to address each question separately and please reply if you still have questions.

1. custom multi-variate data

To use custom data, edit the interval hyperparameter of the primitive mlstars.custom.timeseries_preprocessing.time_segments_aggregate#1 which is detailed in this documentation.

Then to set interval, based on what I can see from the first 5 entries here, the average gap between one entry and another is ~ 0.00420.
(optional) I also recommend converting the timestamp column into an integer, which can be done by multiplying the column with 10^6.

As for multi-variate modeling, all pipelines in orion accept multi- input but uni- output. Therefore, if you would like to detect anomalies in both X-axis Acceleration and Y-axis Acceleration, you will need to create two separate pipelines and change target_column in the second model to 1 instead of 0.

hyperparameters = {
    "mlstars.custom.timeseries_preprocessing.time_segments_aggregate#1": {
        "interval": 0.0042
    },
    "mlstars.custom.timeseries_preprocessing.rolling_window_sequences#1": {
        "target_column": 0
    }
}

2. data format

I address this question in the previous point.

3. example code

You have numerous pipelines to select from, but here is a code for AER

hyperparameters = {
    "mlstars.custom.timeseries_preprocessing.time_segments_aggregate#1": {
        'interval': 0.0042
    },
    'orion.primitives.aer.AER#1': {
        'epochs': 5,
        'verbose': True
    }
}

orion = Orion(
    pipeline='aer',
    hyperparameters=hyperparameters
)

orion.fit(data)

makinno commented 1 year ago

Hello @sarahmish

Thank you for your prompt and helpful response to my questions. Your guidance and detailed explanations were instrumental in resolving the challenges I encountered, and I greatly appreciate your support.

Given that the X-axis acceleration and the Y-axis acceleration data are correlated, I'd like to be able to detect anomalies on both axes using a single ML model. My question is, once the two separate pipelines have been created, is there a way to combine the pipelines when training?

Currently, my code is as follows:

# Define hyperparameters for X-axis Acceleration
hyperparameters_X = {
    "mlstars.custom.timeseries_preprocessing.time_segments_aggregate#1": {
        "time_column": "Timestamp",
        "interval": 4399
    },
    "mlstars.custom.timeseries_preprocessing.rolling_window_sequences#1": {
        "target_column": 0
    },
    'orion.primitives.aer.AER#1': {
        'epochs': 5,
        'verbose': True
    }
}

# Define hyperparameters for Y-axis Acceleration
hyperparameters_Y = {
    "mlstars.custom.timeseries_preprocessing.time_segments_aggregate#1": {
        "time_column": "Timestamp",
        "interval": 4399
    },
    "mlstars.custom.timeseries_preprocessing.rolling_window_sequences#1": {
        "target_column": 1
    },
    'orion.primitives.aer.AER#1': {
        'epochs': 5,
        'verbose': True
    }
}

orion_X = Orion(pipeline='aer', hyperparameters=hyperparameters_X)
orion_Y = Orion(pipeline='aer', hyperparameters=hyperparameters_Y)

# Fit the models separately
orion_X.fit(train_data[['Timestamp', 'X-axis Acceleration']])
orion_Y.fit(train_data[['Timestamp', 'Y-axis Acceleration']])

I have an additional question regarding the test data. Do the test data require the same preprocessing as the training data? For your information, the test data has exactly the same initial CSV format as the data I used for training.

I would appreciate your insights on this matter. Your expertise and assistance are highly valued.

sarahmish commented 1 year ago

The two pipelines are trained separately, however, the correlation between x-axis and y-axis data is taken into consideration since the pipeline takes both time series as input during the training phase.

Yes, test data should be similar to the training data format. You will need to apply the same transformations to the timestamp column!

makinno commented 1 year ago

Hi @sarahmish ,

I'm currently testing the Orion pipeline, and I encountered an issue with the results. When running the pipeline, the output displayed nothing. I'm unsure if this indicates that there are no anomalies in the data I provided or if there might be an issue with my training data.

Here's a snippet of the testing data format I used:

   Timestamp  X-axis Acceleration  Y-axis Acceleration
       2987             0.117860             0.027049
       7191            -0.375057            -0.010952
     11339            -0.375057             0.065050
     15514            -0.033807            -0.010952
     19987            -0.640474             0.027049

I've also attached an image of the results for reference.

屏幕截图 2023-11-26 230437

Could someone please advise on whether the absence of results suggests no anomalies in the data or if there might be an issue with my training data? Thank you in advance for your assistance!

sintel-dev / Orion