Parameter tuning algorithm selection and implementation - Githubissues

wandnz / streamevmon

Framework and pipeline for time series anomaly detection

GNU General Public License v3.0

1 stars 1 forks source link

Parameter tuning algorithm selection and implementation #34

Closed wandgitlabbot closed 3 years ago

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-08-24

There's a number of approaches we can take for parameter tuning. The Wikipedia page gives a good overview, but we might be able to find some libraries that do the hard bits for us.

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-09-09

smac branch implements SMAC2 as a tuning algorithm. It works, but further configuration should be done as follows:

[x] We should set one of --cputime-limit, --runcount-limit, --iteration-limit, or --wallclock-limit.
[ ] We should consider using results from random/ and old smac/ runs to warmstart SMAC.
[x] We could also use --rungroup to put each execution in a unique folder so they're more organised.
[ ] It could also be worth investigating Shared Model Mode to improve performance.

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-09-09

Runtime limit flags are supported via passthrough. We're piggybacking on SMAC's JCommander version, which is pretty old and not very good, but it's bundled. The usage formatting is really weird...

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-09-09

I'm not super happy with using the ParameterTuner object as global variable storage, but the darn ServiceLoader approach makes it impossible to just pass configuration to the TAE which then builds everything... I'll probably need to find a way to write these config items to the filesystem, but how is the TAE supposed to find the unique config location per run?

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-09-09

Warm starting will need us to discover, then merge old SMAC run data files using the ca.ubc.cs.beta.aeatk.example.statemerge.StateMergeExecutor. We would need to be careful to only discover runs using the same detectors and score targets. We could probably use this to imitate Shared Model mode if we wanted to.

However, we don't really need this functionality yet. We'd need a shared filesystem, which we could use the WAND or R block computers for, and it would be extra work to verify that it's functioning properly. Since we have a bunch of configurations we want to try out before settling on one, it's not going to be super needed just yet.

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-09-10

how is the TAE supposed to find the unique config location per run?

Environment variables would do it. These wouldn't need to be replicated between separate hosts running in shared model mode or using warmups.

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-09-10

Environment variables work well.

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-09-10

A few more bugfixes to make it work on my uni workstation. Time to list all the tests I want to do.

[ ] All detectors individually with only standard score target (5 tests)
- [ ] Baseline
- [ ] Changepoint
- [ ] DistDiff
- [ ] Mode
- [ ] Spike
[ ] All detectors individually with mean of all score targets (5 tests)
- [ ] Baseline
- [ ] Changepoint
- [ ] DistDiff
- [ ] Mode
- [ ] Spike
[ ] All detectors together with only standard score target (1 test)
[ ] All detectors together with mean of all score targets (1 test)

Each of these should be run for enough time to come up with a reasonable solution. The test of baseline with standard target got to a 10-point improvement in around 250 runcount. I don't remember how long it took, but it might be worth investigating what a good stopping point is.

While they're running, I can document this module as well.

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-09-10

I should also implement support for the PCS file's forbidden parameters, and restrict the bounds of certain other parameters like changepoint's time between events.

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-09-10

Ran a changepoint test with only the standard score target overnight. It got through 316 runs before stalling. It never finished ~~despite being given a wallclock time-limit~~ Turns out the wallclock limit wasn't properly applied due to a typo in the run script. The scorer has hanged and doesn't seem to be using any CPU. There is still free space on the disk, but the RAM is getting pretty full.

htop clip

Above is a clip of the relevant htop fields. I've had this problem a couple of times with this project - long runtimes seem to invariably lead to high RAM usage and stalling. When I was taking archives of esmond data, I was able to finish the run by increasing -Xmx, but I don't remember if I was getting stalls or OutOfMemory crashes.

I think I forgot to write an update on this note. I "solved" the stalling problem by setting a timeout on the scorer process. It often completes but never exits, meaning in that case we can just pick up the results and ignore that it never properly finished. This issue is getting to be a bit of a mess.

11 incumbent runs were found out of the 315 completed runs. The last incumbent was the 290th, which is pretty close to the last run we did. This implies that there's still an opportunity to get a better run from this procedure if it's left longer. We need to fix the stalling issue first, which is going to be awful to diagnose since it takes hours to show itself.

The last incumbent's parameters and results were as follows:

ignoreOutlierNormalCount = 584.6803086375437
maxHistory = 62.4016019195083
minimumEventInterval = 21.004046744153676
severityThreshold = 74.18793507653373
triggerCount = 50.10933738319983
inactivityPurgeTime = 2147483647 (fixed)
reward_low_FN_rate -> -41.27976077106019
reward_low_FP_rate -> 2.3884360590556413
standard -> 4.8409288841122455

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-09-14

I forgot that I'd disabled NAB output file cleanup before running overnight, so the disk filled up again. This was a baseline test, with the average of all three score targets, and the following best incumbent:

detector.baseline.maxHistory = 11
detector.baseline.percentile = 0.5433900309369988
detector.baseline.threshold = 39
reward_low_FN_rate = -20.96714806045579
reward_low_FP_rate = 5.538053878273805
standard = 15.239748491268339

This was config 208 of a total of 632 runs tried.

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-09-15

Josh pointed me towards the idea of using an object-oriented representation of forbidden parameters, so that the interface would look something like params.addLessThanConstraint("paramA", "paramB").

This will probably end out replacing the code-based Parameters validation that is currently declared in most detectors. We also want to extend the restriction-declaring object with a method to convert it into a SMAC PCS-file entry. This extension (whether it's a subclass or an extension method) will go in the parameterTuner module.

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-09-15

Another overnight run completed. This time it was the changepoint detector with only the standard score target. This was run 721, and 1299 were completed with a wallclock time limit of 55,000s.

detector.changepoint.ignoreOutlierNormalCount = 499
detector.changepoint.maxHistory = 74
detector.changepoint.minimumEventInterval = 540
detector.changepoint.severityThreshold = 41
detector.changepoint.triggerCount = 50
reward_low_FN_rate = -42.26151111819322,
reward_low_FP_rate = 2.004691307691443,
standard = 4.290213019737836

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-09-17

I've upgraded to SMAC 2.10.03, since it has support for comparisons in forbidden parameters. This should cut down on useless tests by some amount, and the update apparently comes with other efficiency features. The downside is that the parameterTuner module now has to be licensed under AGPLv3 or later. I've added that detail into the sbt module definition, but I probably also need to add a LICENSE file, possibly a NOTICE file, and a copy of the copyright text to file headers within the module. I could use either the IntelliJ plugin or the sbt-headers plugin to automate this.

wandgitlabbot commented 3 years ago

In GitLab, by Daniel Oosterwijk on 2020-09-18

The feature is basically complete at this point, and this issue is an absolute mess. I'm going to close it and move all the results over to a new page at the as-yet-unused wiki for this project. That should hopefully be tidier than trying to keep track of an issue.