Improve NAB scores #31

Closed wandgitlabbot closed 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-08-21

The existing detectors (baseline, changepoint, distdiff, mode, spike) have been tested against the NAB dataset, but the results were not very impressive. Three of the five detectors produced no events, and the two that did had a low accuracy. I believe that these scores could be improved by tuning the detectors' configurations to the dataset. NAB testing mandates that each detector must have a single config for the entire dataset, and that detectors may not look ahead at what's to come to pre-tune themselves. Luckily, that's how we're already set up :)

The results are as follows:

Detector	Standard Profile	Reward Low FP	Reward Low FN
Baseline	0.00	0.00	2.66
Changepoint	0.00	0.00	0.00
DistDiff	10.22	3.88	17.22
Mode	0.00	0.00	0.00
Spike	0.00	0.00	0.00

The official NAB scoreboard as of the time of writing is replicated below:

Detector	Standard Profile	Reward Low FP	Reward Low FN
Perfect	100.0	100.0	100.0
Numenta HTM*	70.5-69.7	62.6-61.7	75.2-74.2
CAD OSE†	69.9	67.0	73.2
earthgecko Skyline	58.2	46.2	63.9
KNN CAD†	58.0	43.4	64.8
Relative Entropy	54.6	47.6	58.8
Random Cut Forest ****	51.7	38.4	59.7
Twitter ADVec v1.0.0	47.1	33.6	53.5
Windowed Gaussian	39.6	20.9	47.4
Etsy Skyline	35.7	27.1	44.5
Bayesian Changepoint**	17.7	3.2	32.2
EXPoSE	16.4	3.2	26.9
Random***	11.0	1.2	19.5
Null	0.0	0.0	0.0

In GitLab, by Daniel Oosterwijk on 2020-08-23

I'm considering using some form of automated parameter tuning to see if I can reduce the workload of doing this manually. This would involve a fair bit of work before it functions, but is likely to be useful on other workloads once complete.

Currently, we have an entrypoint that runs all the detectors against the NAB dataset. We then do some post-processing of the results in Python before running it against the NAB scorer, also in Python (although we run the scorer via a Bash script). Automating this would involve the following steps:

Create a wrapper entrypoint that runs our existing entrypoint with varying parameters.
- This should have some knowledge of the limits of each parameter (ie 0-1, 1-100, etc).
- It might be worth investigating a code-driven parameter specification that should be implemented by detectors. This would allow us to perform more centralised config file validation. Currently, detectors either manually check parameter bounds or just don't bother at all.
- Depending on the resource requirements of the pipeline, it could be worth running multiple instances in parallel.
- This could be implemented either via a subprocess that spins up multiple Flink pipelines, or by modifying the pipeline itself to have parallel instances of detectors with varying config. The first option would likely be easier but less efficient, while the second option would take a fair bit more setup but perform much better (no cluster startup time, only reading input files once, letting Flink handle scheduling).
- Depending on the parameter optimisation method selected, getting results in parallel may not be supported.
Create a pluggable system to collect, format, and score results.
- This could be achieved by using a subprocess to call the Python scripts, but it would probably not be a bad idea to have an in-Scala way to format NAB results. This would also give us an opportunity to refactor the formatting code and make it have a reasonable runtime.
- Regardless of how we collect and format the results, the NAB scorer must be run in Python.
Investigate parameter tuning algorithms. The linked Wikipedia article presents a number of approaches that could be evaluated.

I'm going to spin these steps off into several issues, and maybe even try out Gitlab's Milestones feature.

In GitLab, by Daniel Oosterwijk on 2021-01-21

I've finished parsing the logs for the final optimisation run, and talked about it in the wiki at Parameter Tuning Results. I'll include the script I used to run the tests in the repo, and close the issue.

wandnz / streamevmon

Improve NAB scores #31