Discrepancy in Reproduced F1 Scores Compared to Published Results

Orion version: 0.6.0
Python version: 3.10.12
Operating System: Ubuntu 22.04
Dependencies installed using make install-develop

Description

I am attempting to reproduce the results of the research paper AER: Auto-Encoder with Regression for Time Series Anomaly Detection. I ran benchmark.py and obtained the results. However, the results show a significant discrepancy when compared to the F1 scores reported in the research paper. Could you please help investigate this discrepancy? Any guidance on whether I might be missing a step or misinterpreting the results would be greatly appreciated.

What I Did

Run benchmark.py and obtain the results for each signal.
Compare these results with those in Orion/benchmark/results/0.6.0.csv; the values in my results do match those in the file.
Calculate the average F1 scores of signals from my results.
Compare the average F1 scores with leaderboard.xlsx. There are minor differences between my results and the leaderboard.xlsx.
Compare both sets of results with the F1 scores published in the paper; they exhibit a significant discrepancy.

sintel-dev / Orion

Discrepancy in Reproduced F1 Scores Compared to Published Results #539

Description

What I Did