Synthetic datasets for different classes of anomalies

breznak commented 9 years ago

Define, create and include synthetic datasets for different kinds of anomalies. This is important for regressions, as the simple data can stress (at different difficulties) certain properties of HTM. I will also help to define concrete advantages and weak spots of HTM.

[ ] collect and describe all "theoretical classes" of anomalies
- 1D
- [ ] point anomaly
- [ ] amplitude shift
- [ ] phase shift
- [ ] frequency shift
- [ ] noisy
- [ ] combination of the above
- [ ] generating distribution change
- nD
- [ ] de/correlated variables (multi modal input)
- by input
- [ ] data with "holes"
- [ ] "tricky" data, designed to look similar (overlapping sequences,..)
- [ ] auto-tuning on "far data", eg each 1000th is A, then num 121000 is B instead of A.
[ ] generate data
- [ ] synthetic data for each of the classes
- [ ] a well known published dataset on given class, for comparison with other algs.
Theoretical challenges
- [ ] Different kinds of anomalies
  We want to detect all as anomalies, but we may want to differentiate among them. An examples is in the ECG MIT-BIH data, where there are _V_etricular anomalies (easy) and about 4 more types. This somewhat combines anomaly detection with classification of sequences.
- [ ] Scope!
  For example temperature. Measured every morning, 7am I get a relatively stable, slow changing pattern; measured every hour I get stable pattern with significant changes; measured every 7h it looks like random data.
  So the question is, how can HTM "decide" optimal aggregation, focus scale? An example, GPS position reported every second, how do you scale?
- [ ] Model auto-adaptation
  Should all of these be part of one HTM/anomaly model? Or run as an ensemble of specific models?
- [ ] Anomaly prediction
  Yes, it's an oxymoron, but everybody wants it! :icecream: I think this is a core problem, my ideas include running combination HTM of different HTM models (with different scale) ...

breznak commented 9 years ago

@subutai This has been on my mind for a while, I'd love to hear your thought on it! :question:

subutai commented 9 years ago

NAB already includes a few artificial datasets, some of which fall into your classes above. I think it is fine to create some more elsewhere (i.e. another repo) that are NAB compatible, but I don't really want to add them into the formal benchmark. I want to focus NAB mostly on real world data and would ideally like to even get rid of the existing artificial datasets. There are lots of other anomaly benchmarks with artificial data.

breznak commented 5 years ago

I want to focus NAB mostly on real world data and would ideally like to even get rid of the existing artificial datasets.

Revisiting this. Your decision sounds fair, I'll setup a NAB.synthetic.

NAB: show how algorithms behave really (in real life examples)
NAB.synthetic: explain why/where the algorithms fail (to detect)

subutai commented 5 years ago

👍

numenta / NAB

Synthetic datasets for different classes of anomalies #217