sintel-dev / Orion

A machine learning library for detecting anomalies in signals.
https://sintel.dev/Orion/
MIT License
993 stars 159 forks source link

Data Leakage in training #540

Open kargarisaac opened 1 month ago

kargarisaac commented 1 month ago

Hey,

I'm trying to understand the TadGAN training procedure better. If I understood correctly, you don't train the model using only normal/good data. Based on the examples, I see the model also sees some anomalous segments during training but without any label. This is a big problem in autoencoder-based anomaly detection systems to have some kind of leakage of abnormal data in the training dataset. This makes the autoencoder to be able to reconstruct even the abnormal data and this affects the performance.

Can we say that TadGAN is kind of robust against leaking abnormal data into the training dataset? And whayt do you think is the reason?

Thank you

sarahmish commented 1 month ago

Hi @kargarisaac - thanks for raising an interesting question.

The assumption that the data is free from anomalies is a strong assumption to make since (in most cases), users do not have labeled data apriori. TadGAN is completely unsupervised as it does not require to make this distinction for training. Specifically, the model should not be able to reconstruct anomalies as well as normal sequences, because (1) we assume that anomalies will lose their information during encoding; (2) anomalies are infrequence scarce compared to the normal instances.