sintel-dev / Orion

Library for detecting anomalies in signals
https://sintel.dev/Orion/
MIT License
1.05k stars 162 forks source link

Reproduction of the tadGAN results over NASA telemetry dataset, as given in the corresponding research paper #521

Open Chiradipb02 opened 8 months ago

Chiradipb02 commented 8 months ago

Description

I am trying to reproduce the results of the TadGAN model proposed in the paper 'TadGAN: Time Series Anomaly Detection Using Generative Adversarial Networks'  and perform benchmarking. For the efficient result reproduction for smap and msl spacecraft datas, what hyperparameter values should i use? Or if the trained model weights are available, how can I use them and where to find?

What I Did

I am currently using the hyperparameters given in the tadgan_smap.json file and the tadgan pipeline. But training for even 35 epochs is quite time taking and expensive on colab.

using tadgan pipeline

from orion.analysis import analyze
from orion import Orion

hyperparameters_smap = {
    "mlstars.custom.timeseries_preprocessing.time_segments_aggregate#1": {
        "time_column": "timestamp",
        "interval": 21600,
        "method": "mean"
    },
    "orion.primitives.tadgan.TadGAN#1": {
        "epochs": 35
    },
    "orion.primitives.tadgan.score_anomalies#1": {
        "rec_error_type": "dtw",
        "comb": "mult"
    },
    'sklearn.preprocessing.MinMaxScaler#1': {
        'feature_range': (-1, 1)
    }
}

orion_S1 = Orion(
    pipeline='tadgan', #<---using tadgan
    hyperparameters=hyperparameters_smap
)

from orion.evaluation.contextual import contextual_accuracy, contextual_f1_score

metrics = [
    'f1',
    'recall',
    'precision',
    'mse'
]

scores = orion_S1.evaluate(s1_data, known_anomalies, fit=True, metrics=metrics)

/usr/local/lib/python3.10/dist-packages/sklearn/impute/_base.py:382: FutureWarning: The 'verbose' parameter was deprecated in version 1.1 and will be removed in 1.3. A warning will always be raised upon the removal of empty columns in the future version. raise ValueError( Epoch: 1/35, Losses: {'cx_loss': -3.8102, 'cz_loss': -6.1769, 'eg_loss': 14.5929} Epoch: 2/35, Losses: {'cx_loss': -27.2824, 'cz_loss': -17.5141, 'eg_loss': 12.9289} Epoch: 3/35, Losses: {'cx_loss': -61.2001, 'cz_loss': 4.8052, 'eg_loss': -42.0269} Epoch: 4/35, Losses: {'cx_loss': -75.2493, 'cz_loss': 1.3807, 'eg_loss': -93.7864} Epoch: 5/35, Losses: {'cx_loss': -88.7766, 'cz_loss': -4.012, 'eg_loss': -224.5672} Epoch: 6/35, Losses: {'cx_loss': -172.3133, 'cz_loss': -1.4226, 'eg_loss': -512.4106} Epoch: 7/35, Losses: {'cx_loss': -251.6021, 'cz_loss': 0.2382, 'eg_loss': -1304.6923}

2 of the losses are diverging

Using tadgan.json pipeline

from orion.analysis import analyze
from orion import Orion

hyperparameters_smap = {
    "mlstars.custom.timeseries_preprocessing.time_segments_aggregate#1": {
        "time_column": "timestamp",
        "interval": 21600,
        "method": "mean"
    },
    "orion.primitives.tadgan.TadGAN#1": {
        "epochs": 35
    },
    "orion.primitives.tadgan.score_anomalies#1": {
        "rec_error_type": "dtw",
        "comb": "mult"
    },
    'sklearn.preprocessing.MinMaxScaler#1': {
        'feature_range': (-1, 1)
    }
}

orion_S1 = Orion(
    pipeline='tadgan.json', #<-----
    hyperparameters=hyperparameters_smap
)

from orion.evaluation.contextual import contextual_accuracy, contextual_f1_score

metrics = [
    'f1',
    'recall',
    'precision'

]

scores = orion_S1.evaluate(s1_data, known_anomalies, fit=True, metrics=metrics)

/usr/local/lib/python3.10/dist-packages/sklearn/impute/_base.py:382: FutureWarning: The 'verbose' parameter was deprecated in version 1.1 and will be removed in 1.3. A warning will always be raised upon the removal of empty columns in the future version. raise ValueError( Epoch: 1/35, Losses: {'cx_loss': -4.8175, 'cz_loss': -4.2402, 'eg_loss': 9.4242} Epoch: 2/35, Losses: {'cx_loss': -83.4143, 'cz_loss': 0.4275, 'eg_loss': -11.9605} Epoch: 3/35, Losses: {'cx_loss': -118.1121, 'cz_loss': 1.6406, 'eg_loss': -82.1093} Epoch: 4/35, Losses: {'cx_loss': -96.1586, 'cz_loss': 1.2744, 'eg_loss': -240.1017} Epoch: 5/35, Losses: {'cx_loss': -171.8538, 'cz_loss': 2.1408, 'eg_loss': -488.9143} Epoch: 6/35, Losses: {'cx_loss': -41.5881, 'cz_loss': -4.3703, 'eg_loss': -728.5376} Epoch: 7/35, Losses: {'cx_loss': -50.8743, 'cz_loss': 2.2937, 'eg_loss': -936.7202} Epoch: 8/35, Losses: {'cx_loss': -100.3114, 'cz_loss': -2.4848, 'eg_loss': -517.6544} Epoch: 9/35, Losses: {'cx_loss': -99.0258, 'cz_loss': 2.6696, 'eg_loss': -1218.6364} Epoch: 10/35, Losses: {'cx_loss': -21.0454, 'cz_loss': 1.6388, 'eg_loss': -101.3644} Epoch: 11/35, Losses: {'cx_loss': -46.9198, 'cz_loss': -1.7119, 'eg_loss': -350.9029} Epoch: 12/35, Losses: {'cx_loss': -6.086, 'cz_loss': 2.2132, 'eg_loss': -1045.5497} Epoch: 13/35, Losses: {'cx_loss': -391.7562, 'cz_loss': -1.0262, 'eg_loss': -677.0828} Epoch: 14/35, Losses: {'cx_loss': 194.6997, 'cz_loss': 2.7028, 'eg_loss': -686.2322} Epoch: 15/35, Losses: {'cx_loss': -1681.5021, 'cz_loss': 2.1015, 'eg_loss': -1501.0836} Epoch: 16/35, Losses: {'cx_loss': -2299.2453, 'cz_loss': -0.7873, 'eg_loss': -1303.4165} Epoch: 17/35, Losses: {'cx_loss': -1361.4151, 'cz_loss': 1.9527, 'eg_loss': -1605.7816} Epoch: 18/35, Losses: {'cx_loss': -177.5991, 'cz_loss': 1.1469, 'eg_loss': -1359.632}

the runtime gets disconnected in between

Other Approach

There is a txt file in the nasa dataset zip link given in the paper. That txt file contained some model parameters as well. Also in the models folder there were .h5 files for each dataset file. I tried to load on of them to tadgan model, after preprocessig the data as given in the Tulog.ipynb

tgan_trained = tf.keras.models.load_model('/content/S-1.h5')

Training data input shape: (10124, 25, 1) Training data index shape: (10124,) Training y shape: (10124, 1) Training y index shape: (10124,)

there was some dimensional error as required shape appeared to be (none,none,25). So I had to reshape the data

x_new=X.reshape((X.shape[0],X.shape[-1],X.shape[1]))
# reconstruct
X_hat = tgan_trained.predict(x_new)

# visualize X_hat
y_hat=unroll_ts(X_hat)
print(y_hat)

317/317 [==============================] - 1s 2ms/step [-0.07230038 -0.06876311 -0.05986287 ... -0.06961043 -0.06137734 -0.06368994]

X_hat shape: (10124, 10) y_hat shape: (10124,)

model_weight_test the reconstruction from trained model

What can I do?

notebook link: https://colab.research.google.com/drive/1zahCbCImRuL2_Hc-ms1WSZl7oUyP32Q3?usp=sharing

Chiradipb02 commented 8 months ago

after completing the training for 35 epochs:--

Epoch: 1/35, Losses: {'cx_loss': -4.8175, 'cz_loss': -4.2402, 'eg_loss': 9.4242} Epoch: 2/35, Losses: {'cx_loss': -83.4143, 'cz_loss': 0.4275, 'eg_loss': -11.9605} Epoch: 3/35, Losses: {'cx_loss': -118.1121, 'cz_loss': 1.6406, 'eg_loss': -82.1093} Epoch: 4/35, Losses: {'cx_loss': -96.1586, 'cz_loss': 1.2744, 'eg_loss': -240.1017} Epoch: 5/35, Losses: {'cx_loss': -171.8538, 'cz_loss': 2.1408, 'eg_loss': -488.9143} Epoch: 6/35, Losses: {'cx_loss': -41.5881, 'cz_loss': -4.3703, 'eg_loss': -728.5376} Epoch: 7/35, Losses: {'cx_loss': -50.8743, 'cz_loss': 2.2937, 'eg_loss': -936.7202} Epoch: 8/35, Losses: {'cx_loss': -100.3114, 'cz_loss': -2.4848, 'eg_loss': -517.6544} Epoch: 9/35, Losses: {'cx_loss': -99.0258, 'cz_loss': 2.6696, 'eg_loss': -1218.6364} Epoch: 10/35, Losses: {'cx_loss': -21.0454, 'cz_loss': 1.6388, 'eg_loss': -101.3644} Epoch: 11/35, Losses: {'cx_loss': -46.9198, 'cz_loss': -1.7119, 'eg_loss': -350.9029} Epoch: 12/35, Losses: {'cx_loss': -6.086, 'cz_loss': 2.2132, 'eg_loss': -1045.5497} Epoch: 13/35, Losses: {'cx_loss': -391.7562, 'cz_loss': -1.0262, 'eg_loss': -677.0828} Epoch: 14/35, Losses: {'cx_loss': 194.6997, 'cz_loss': 2.7028, 'eg_loss': -686.2322} Epoch: 15/35, Losses: {'cx_loss': -1681.5021, 'cz_loss': 2.1015, 'eg_loss': -1501.0836} Epoch: 16/35, Losses: {'cx_loss': -2299.2453, 'cz_loss': -0.7873, 'eg_loss': -1303.4165} Epoch: 17/35, Losses: {'cx_loss': -1361.4151, 'cz_loss': 1.9527, 'eg_loss': -1605.7816} Epoch: 18/35, Losses: {'cx_loss': -177.5991, 'cz_loss': 1.1469, 'eg_loss': -1359.632} Epoch: 19/35, Losses: {'cx_loss': -3798.2821, 'cz_loss': 2.6662, 'eg_loss': -2312.6722} Epoch: 20/35, Losses: {'cx_loss': -10793.4622, 'cz_loss': 0.8595, 'eg_loss': -4401.6898} Epoch: 21/35, Losses: {'cx_loss': 4510.4191, 'cz_loss': 0.6463, 'eg_loss': -3757.3169} Epoch: 22/35, Losses: {'cx_loss': 54413.2969, 'cz_loss': -2.2831, 'eg_loss': -4157.8031} Epoch: 23/35, Losses: {'cx_loss': 3813.4309, 'cz_loss': -0.0306, 'eg_loss': -3582.8478} Epoch: 24/35, Losses: {'cx_loss': 1734.2098, 'cz_loss': 2.5015, 'eg_loss': -3394.293} Epoch: 25/35, Losses: {'cx_loss': 1301.2073, 'cz_loss': -1.1375, 'eg_loss': -5869.3654} Epoch: 26/35, Losses: {'cx_loss': 3266.5693, 'cz_loss': 2.1572, 'eg_loss': -13029.8635} Epoch: 27/35, Losses: {'cx_loss': 2313.6272, 'cz_loss': -0.0213, 'eg_loss': -18815.4439} Epoch: 28/35, Losses: {'cx_loss': 3218.0648, 'cz_loss': 3.0745, 'eg_loss': -25817.8698} Epoch: 29/35, Losses: {'cx_loss': 3368.6121, 'cz_loss': 3.1033, 'eg_loss': -38295.3714} Epoch: 30/35, Losses: {'cx_loss': 535.1069, 'cz_loss': -1.5381, 'eg_loss': -43057.3359} Epoch: 31/35, Losses: {'cx_loss': 485.7542, 'cz_loss': 2.392, 'eg_loss': -44885.8762} Epoch: 32/35, Losses: {'cx_loss': 137.9176, 'cz_loss': 1.2915, 'eg_loss': -46512.0878} Epoch: 33/35, Losses: {'cx_loss': 27.0721, 'cz_loss': 1.6871, 'eg_loss': -46914.7306} Epoch: 34/35, Losses: {'cx_loss': 71.8974, 'cz_loss': 3.2924, 'eg_loss': -47328.6803} Epoch: 35/35, Losses: {'cx_loss': 13.1489, 'cz_loss': -0.9914, 'eg_loss': -47412.2166} 315/315 [==============================] - 35s 109ms/step 315/315 [==============================] - 45s 137ms/step 315/315 [==============================] - 8s 25ms/step

sarahmish commented 7 months ago

Hi @Chiradipb02 – thanks for opening an issue and using Orion!

To run the benchmark on NASA dataset, you can use our benchmarking script which will automatically load the necessary hyperparmeter settings, i.e. tadgan_smap.json and tadgan_msl.json.

from orion.benchmark import benchmark, BENCHMARK_DATA

datasets = {
    "MSL": BENCHMARK_DATA["MSL"],
    "SMAP": BENCHMARK_DATA["SMAP"]
}
pipelines = {"tadgan": "tadgan"}

scores = benchmark(pipelines=pipelines, datasets=datasets)

You will need some compute for this to complete in a decent time.

You can also find the latest results of the benchmark (which we run every release) available in the details Google Sheets document and the summarized results can also be browsed in the following summary Google Sheets document.

Chiradipb02 commented 7 months ago

Thank you @sarahmish for your response. I tried to run the code,

from orion.benchmark import benchmark, BENCHMARK_DATA

datasets = {
    "MSL": BENCHMARK_DATA["MSL"],
    "SMAP": BENCHMARK_DATA["SMAP"]
}
pipelines = {"tadgan": "tadgan"}

scores = benchmark(pipelines=pipelines, datasets=datasets)

but some error occurs, for each of the datasets.

ERROR:mlblocks.mlpipeline:Exception caught fitting MLBlock sklearn.preprocessing.MinMaxScaler#1
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/mlblocks/mlpipeline.py", line 644, in _fit_block
    block.fit(**fit_args)
  File "/usr/local/lib/python3.10/dist-packages/mlblocks/mlblock.py", line 311, in fit
    getattr(self.instance, self.fit_method)(**fit_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_data.py", line 427, in fit
    `n_samples` or because X is read from a continuous stream.
  File "/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_data.py", line 450, in partial_fit
    if sparse.issparse(X):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 600, in _validate_params
    self._check_n_features(X, reset=reset)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_param_validation.py", line 97, in validate_parameter_constraints
sklearn.utils._param_validation.InvalidParameterError: The 'feature_range' parameter of MinMaxScaler must be an instance of 'tuple'. Got [-1, 1] instead.
ERROR:orion.benchmark:Exception scoring pipeline <mlblocks.mlpipeline.MLPipeline object at 0x7ab63e38a8f0> on signal M-6 (test split: True), error The 'feature_range' parameter of MinMaxScaler must be an instance of 'tuple'. Got [-1, 1] instead..

for which the output comes:--

   pipeline  rank dataset signal  iteration  accuracy  f1  recall  precision  \
0    tadgan     1     MSL    M-6          0         0   0       0          0   
1    tadgan     2     MSL    M-1          0         0   0       0          0   
2    tadgan     3    SMAP    G-3          0         0   0       0          0   
3    tadgan     4    SMAP    P-4          0         0   0       0          0   
4    tadgan     5    SMAP    F-1          0         0   0       0          0   
..      ...   ...     ...    ...        ...       ...  ..     ...        ...   
75   tadgan    76     MSL    M-7          0         0   0       0          0   
76   tadgan    77     MSL   D-16          0         0   0       0          0   
77   tadgan    78     MSL   D-15          0         0   0       0          0   
78   tadgan    79     MSL   P-11          0         0   0       0          0   
79   tadgan    80    SMAP    F-3          0         0   0       0          0   

   status    elapsed  split      run_id  
0   ERROR  11.174287   True  d4758923-4  
1   ERROR   1.174106   True  d4758923-4  
2   ERROR   1.141722   True  d4758923-4  
3   ERROR   1.600665   True  d4758923-4  
4   ERROR   1.554191   True  d4758923-4  
..    ...        ...    ...         ...  
75  ERROR   0.710938   True  d4758923-4  
76  ERROR   0.675812   True  d4758923-4  
77  ERROR   0.946997   True  d4758923-4  
78  ERROR   1.777963   True  d4758923-4  
79  ERROR   1.651411   True  d4758923-4  

Is it occuring as some values in the datasets are equal to -1 and 1 or due to absence of--

'sklearn.preprocessing.MinMaxScaler#1': {
        'feature_range': (-1, 1)
    }

in tadgan_msl.json and tadgan_smap.json ?

sarahmish commented 7 months ago

the issue is solvable when you downgrade sklearn to 'scikit-learn<1.2'. After version 1.2, sklearn forces the range to be of tuple type rather than a list, whilst the .json file only support list types.

please make sure to install the compatible version of sklearn pip install 'scikit-learn>=0.22.1,<1.2' which should fix the issue above!

Chiradipb02 commented 7 months ago

Thank you @sarahmish for your help, and sorry for the late reply. The code is running properly and giving the appropriate results, but taking a really long time.

One more question:-- When I am trying to use the model on another dataset on kaggle notebook, what type of hyperparameters would I have to choose, so that number of false positives can be reduced and range like anomaly can be detected besides point anomalies.

On a meter reading time-series dataset:-- I have used tadgan pipeline with tadgan_smap.json parameters (as the signals appeared to be somewhat similar):----

hyperparameters_334_61 = {
            "mlstars.custom.timeseries_preprocessing.time_segments_aggregate#1": {
                "time_column": "timestamp",
                "interval": 900, # interval between 2 successive timestamps in dataframe is 900
                "method": "mean"
            },
            "sklearn.preprocessing.MinMaxScaler#1": {
                "feature_range": (-1,1)
            },

            "orion.primitives.tadgan.TadGAN#1": {
                "epochs": 10
            },
    #new added hyper-parameter
            "orion.primitives.tadgan.score_anomalies#1": {
                "rec_error_type": "dtw",
                "comb": "mult"
            }
}

orion_334_61 = Orion(
    pipeline='tadgan.json',
    hyperparameters=hyperparameters_334_61
)

For 10 epochs on a particular portion of the dataframe of shape (18286, 2) in orion format 334_61_part_10_epoch_interval_900_colab

For 25 epochs on the same dataframe 334_61_part_25_epoch_interval_900

For 5 epochs on the whole dataset of shape (75224, 2) 334_61_whole_5_epoch_interval_900 only for this, while training, the e_g_loss was consistently becoming more negative

Epoch: 1/5, Losses: {'cx_loss': -1.4802, 'cz_loss': -0.4702, 'eg_loss': -20.8699}
Epoch: 2/5, Losses: {'cx_loss': -0.7637, 'cz_loss': 2.2239, 'eg_loss': -74.8726}
Epoch: 3/5, Losses: {'cx_loss': -0.6252, 'cz_loss': 2.403, 'eg_loss': -134.6424}
Epoch: 4/5, Losses: {'cx_loss': -0.593, 'cz_loss': 2.4117, 'eg_loss': -158.1663}
Epoch: 5/5, Losses: {'cx_loss': -1.0072, 'cz_loss': 2.3576, 'eg_loss': -263.0882}
2493/2493 [==============================] - 168s 67ms/step
2493/2493 [==============================] - 188s 75ms/step
2493/2493 [==============================] - 28s 11ms/step

for the other 2, the losses did not cross -70

But the anomalous part is mainly the lower flat part. I have tried the fixed threshold parameter (True and False) with similar number of epochs, but no improvement observed. What parameter values can be used here in general?

sarahmish commented 7 months ago

The loss is unbounded for the critic, therefore, it makes sense to see variance between one time series and another. If you'd like, you can set detailed=True when training the model such that you can also observe more intuitive losses such as mean squared error.

To reduce/extend the range of the detected anomalies, here is a hyperparameter called anomaly_padding that defines how many data points to include before and after the point that was considered anomalous. To remove any padding, set the value to zero

hyperparameters = {
    'orion.primitives.timeseries_anomalies.find_anomalies#1': {
        'anomaly_padding': 0 # set to 50 by default
    }
}

for more information, visit the primitive page for find_anomalies.

If all your anomalies look like the flat part of the signal, I think there are simpler algorithms that you try and that are faster than tadgan.

Chiradipb02 commented 6 months ago

Thank you @sarahmish for your reply. I am using the AER model, for anomaly detection. So far it's performance is really faster than tadgan and giving better results. But while trying the approach given in tulog for tadgan, to see how the primitives are working, defining the aer model requires the parameters--- layers_encoder layers_decoder

and some hyperparameters. What are the layer architectures that I can pass as parameter to build the model.

To get the intermediate outputs, is there any method like making visualization =True in detect() method as it was there for tadgan in tulog?

Chiradipb02 commented 6 months ago

Can you please tell what are the functionalities of the hyperparameters 'lower_threshold' and 'min_percent' in the primitive find_anomalies ?