ruiking04 / COCA

Deep Contrastive One-Class Time Series Anomaly Detection
31 stars 9 forks source link

confusion of the threshold part #25

Open yingbaihu opened 9 months ago

yingbaihu commented 9 months ago

Hi, I'm reading your work recently, it's an interesting work and inspired me a lot, thanks very much!

And I have some questions, could you please help me to understand?

I just focus on the UCR dataset, because my situation is same: without anomalous in training dataset, and I'm very new about these kinds of dataset, because you take each element of one time-series sample as a single entity, but in my case, I usually take one time-series sample as one single entity. So, I cannot understand some part of your implementations.

1) about the threshold part, each time-series sample has one score vector, you choose the highest score as the threshold, however, there are many time-series samples, which result in corresponding number of highest scores and then thresholds, in this case, how to define the final threshold? or how to obtain the threshold for test dataset without knowing the test data?

my tuition thinks it should relates to these lines: test_affiliation, test_score, predict = ad_predict(test_target, test_score_origin, config.threshold_determine, config.detect_nu) score_reasonable = tsad_reasonable(test_target, predict, config.time_step) however, the input is test_score_origin, obtained from test dataset. In the evalution mode, you didn't split some batches, it means the whole test dataset sample number should smaller than batchsize (512), but I still cannot understand how to determine the threshold for several samples in the test dataset. Or, in UCR dataset, there is just one time-series?

It's a little abstract for me to understand this part, could you please provide some explainations?

2) when you use MeanVarNormalize to do the standardization process, you also involved test data, this operation will not cause data leakage? mvn.train(train_time_series_ts + test_time_series_ts)

ruiking04 commented 9 months ago

Thank you for your interest in our work.

  1. In time series anomaly detection, UCR is a good dataset, but its drawback is that it is difficult to divide the verification set because there is only one abnormal fragment in the test set. The UCR authors also acknowledge this problem. Therefore, for UCR we regard the sample with the largest anomaly score as an abnormal sample. The F1 score obtained in this way is the best F1 among all thresholds. If the anomaly scores obtained by some models are not very distinguishable, the anomaly scores of many samples are the maximum value. In this case, the accuracy will naturally be low and the F1 will be very poor. The threshold in classification tasks is really difficult to determine, so the generally used indicators are: Best F1 or AUC. For UCR, you only need to take the maximum value. For data sets where the KPI contains many abnormal segments, you can only find it in the quantile 0~100%, or convert to Z-score, in (-3, 3) Find in. I didn't understand this " In the evalution mode, you didn't split some batches, it means the whole test dataset sample number should smaller than batchsize (512)". Why is it less than 512?

  2. We observed that the means and variances of some training sets and test sets are very different. Strictly speaking, the distributions of these training sets and test sets are different. This is a concept drift problem. Therefore, both the test set and the training set were considered during normalization, and there was indeed a problem of data leakage. Our subsequent work found that on UCR and KPI data sets, only considering the mean value of the training set can also achieve good performance.

yingbaihu commented 9 months ago
  1. because there is only one abnormal fragment in the test set

Thanks for the reply, I still have some questions.

  1. "because there is only one abnormal fragment in the test set", so in UCR dataset, this one abnormal fragment is a continuous series, and in your case, each data point in this series is considered as a sample? if yes, when you reply me, can you convert sample to sample (data point)? because I usually call one time-series as one sample, so I have some confusion about this thing.
  2. I have some thinking about the UCR dataset, can you correct the part that I'm wrong? in UCR dataset, train and val has no anomalous sample, each sample has 64 length (64 points). In test dataset, if flatten the entire time-series of each sub-dataset (assume length 5000), there are only one segment with anomalous (many continuous data points, assume length 300).
  3. "If the anomaly scores obtained by some models are not very distinguishable, the anomaly scores of many samples are the maximum value." can you explain what means anomaly scores are not very distinguishable? the thing I don't understand is why the anomaly scores of many samples are the maximum value? why the same maximum value will show a lot of time and continuously? because based on the Figure 3. (b), I think the anomaly segment is a continous sub-series (length around 200-400) in one long time-series (the length over 4000)
  4. about the batchsize thing, I misunderstood this part, just forget it~
  5. Wish you all the best in your follow-up work~
yingbaihu commented 9 months ago

And one question more: why you introduce soft-boundary invariance? just because the training dataset contains anomalous? or another reason? if the training dataset has no anomalous, is the soft-boundary invariance still suitable for this case?

ruiking04 commented 9 months ago
  1. Our model generates anomaly scores according to time windows, that is, one anomaly score per sample/time window. So we need to convert the original labels of the dataset into window-based labels. Unless the time step is set to 1, then it can be calculated one by one. But that calculation is too slow. Reconstruction-based methods (LSTM-ED) can naturally generate anomaly scores point by point.

  2. Because the KPI's training set contains anomalies. Although the training sets of some data sets do not have labeled anomalies, it is found through experiments that soft-boundary is more effective. We guess it contains some noisy data.

yingbaihu commented 8 months ago

Hi, sorry to bother you again, still one question about computing the center_c when you compute the center_c and also during the training stage, you combine the original data with two augmented data, can I ask why? and for computing the center_c, you divided 2, 'c /= (2 * n_samples)', but you triple the data size, can you explain this part to me?

ruiking04 commented 8 months ago

We input the original and augmented data into the model to get the original 'outputs' and 'dec ', and then 'c /= (2 * n_samples)'.

image

When training the model, original data and enhanced data are used. So, naturally, augmented data is also required for computing center 'center_c'. image

yingbaihu commented 8 months ago

emmm, I mean why divide 2 n_samples (c /= 2 n_samples), not 3*n_samples

ruiking04 commented 8 months ago
all_data = torch.cat((data, aug1, aug2), dim=0)
outputs, dec = model(all_data)
n_samples += outputs.shape[0]
all_feature = torch.cat((outputs, dec), dim=0)
c += torch.sum(all_feature, dim=0)

'all_data' already contains 'data', 'aug1', and 'aug2'. Both 'outputs' and 'dec' are used to calculate the center 'c', and 'n_samples' is just the amount of "outputs'. So 'c' needs to be divided by 2.