Different F1 score for same signal for different time_segments_aggregate interval

Sid-030591 commented 5 months ago

Orion version: orion_ml-0.5.3.dev1-py2.py3-none-any.whl
Python version: Python 3.10.12
Operating System: Windows 11 Home

Description

I am using AER pipeline for detecting anomalies on a synthetic dataset that I have created. Dataset follows MA 1 characteristics with 7 anomalies added at random instants. Timestamp is sampled at 1 hour (3600). Now, when I run this with time_segments_aggregate with 3600 interval, only 1 out of 7 anomaly is detected and it takes around 15 minutes. On the contrary, when I run the same dataset with time_segments_aggregate with 21600 interval, all 7 anomalies are detected in around 3 minutes time. Could you please explain how interval value is actually having an impact on F1 score? I can understand its impact on the time taken.

What I Did

for file in tqdm(csv_files):
  file_path = os.path.join(folder_path, file)
  our_data = pd.read_csv(file_path)

  our_data = our_data[['timestamp','value']]

  hyperparameters = {
    "mlstars.custom.timeseries_preprocessing.time_segments_aggregate#1": {
        "time_column": "timestamp",
        "interval": 21600,
        "method": "mean"
    },
    'orion.primitives.aer.AER#1': {
        'epochs': 35,
        'verbose': False
    }
}

  orion = Orion(
    pipeline='aer',
    hyperparameters=hyperparameters
  )

anomalies = orion.fit_detect(our_data)

sarahmish commented 5 months ago

Hi @Sid-030591, thank you for using Orion!

Please refer to the documentation of time_segments_aggregate to view how the aggregation is made.

With interval=21600 you will aggregate 6 hours into a single value, making your time series shorter and thus the model should be faster.

As for the performance, can you provide a snippet of what the input & output looks like? how many intervals did the pipeline detect?

Sid-030591 commented 5 months ago

Behavior_description.docx

Hello @sarahmish , thank you for your response. I have described my observation in attached document. It comes out to be different than what I had initially thought. Nevertheless, this seems interesting to me. Please let me know your understanding on this.

sarahmish commented 5 months ago

Thanks for the description @Sid-030591!

your reported results make sense, when the threshold is fixed, we always capture the same extreme values (4 standard deviations away from the mean) and therefore obtain the same result. However, when the threshold is dynamic (fixed_threshold=False) the results will change as there is an element of randomness in this approach.

I hope that this answers your question!

Sid-030591 commented 5 months ago

Thank you @sarahmish for the answer. I have few follow up questions: 1)I had a quick glance at the find_threshold function where this is handled. Is it possible to give a quick idea of what's the logic there? I can see you are calculating some sort of cost function and optimizing it to get the best z. I am basically, trying to understand the source of randomness in this logic. Also, is it because of this randomness that you chose fixed k one as your default method or is there any other reason?

2) For the purpose of reproducibility, is there any way to use sort of seed for getting the consistent results?

3) For the purpose of benchmarking etc, I would like to know how do you do the training and test. For eg: for NAB dataset, say Tweets or adex dataset, total number of datapoints are not a lot. Do you train on the whole timeseries and then test on the same series or is there a general test train split that you maintain? I could not find this part in the code.

4) I would also like to know how we can run the pipeline on timeseries which have multiple columns (features). I saw a previous issue wherein something similar was answered. Say, i have a dataset with 50 columns. If i have to fit the pipeline individually , it is like an overhead. Of course, I can loop it up but I wanted to understand if the current implementation throws an error for multiple columns or would it still run fine? Would there be a difference say when columns are independent or they are correlated.

Point 3 and 4 are not actual follow ups on the initial topic. But rather than opening a new issue, I thought of writing in this one. Hope this is ok for you. I have been doing some work on AD area and thus have all these questions, Hope this is also ok.

Sid-030591 commented 5 months ago

@sarahmish I think i found the answer to the 3rd and 4th question. So, 0.2 is the validation split that you use for 3rd question. Also, we can train the pipeline on multi columns , anomalies will be found on univariate (single target column to be provided). Please confirm if this is the correct understanding. Also, I would appreciate if you could answer the 1st and 2nd question.

sarahmish commented 4 months ago

Hi @Sid-030591, I was referring to the randomness of the model, and thereafter the error values.

find_anomalies with dynamic threshold is a more sensitive to changes from one to another, whereas when you make it fixed, it becomes more consistent.

I would first recommend referring to issue #375 to see some detail on find_anomalies. I also made a notebook so you can see how the error values change from one run to another, as a result, the detection is substantially different when fixed_threshold=False while being more consistent when fixed_threshold=True. I also want to note that with every run you'll get something different.

Let me know if you have other questions!

Sid-030591 commented 4 months ago

Hello @sarahmish , thank you for your response. I understand the point wrt randomness in the used model and also in the post processing step. I would like to know one more point here - so let's say you are doing some performance benchmarking/comparison. Say, AER model with reg ratio as 0.5 (default) and 0.3, 0.4. Now, we will get different F1 values. How should we conclude regarding what part of this difference is coming from the inherent randomness and what part is from the actual change ( which is in the reg ratio value) - specially when the values are not too different (let's say). One way could be to run many simulations and possibly take an average to come up with a better estimate. What's you understanding on this?

sarahmish commented 4 months ago

Depending on the variable you are changing you can attribute this change. For example, reg_ratio determines the weight of the forward/backward regression and reconstruction. Higher values for reg_ratio means that you are emphasizing the importance of regression model, and vice versa.

In practice, however, running the same model will yield close results but not identical. To reduce the variability, I recommend running the model for n iterations to see some consistency in the results.

sintel-dev / Orion

Different F1 score for same signal for different time_segments_aggregate interval #511

Description

What I Did