numenta / NAB

The Numenta Anomaly Benchmark
GNU Affero General Public License v3.0
1.93k stars 868 forks source link

Bias in "streaming" datasets by API offering minValue,maxValue from base.py #350

Open breznak opened 5 years ago

breznak commented 5 years ago

The abstract class AnomalyDetector (from base.py) dictates in its API min/max bounds of the dataset. https://github.com/numenta/NAB/blob/master/nab/detectors/base.py#L47

This is a bias as it makes it easier for encoders to choose optimal settings. In real life on streaming datasets such value is not know. And encoders have to deal with the fact (using encoder that does not require fixed bounds (RDSE) or setting the values large (too) enough).

I think this is a bug in NAB API design and the information should be removed.

breznak commented 5 years ago

CC @ctrl-z-9000-times what do you think? We'll fix this for community/NAB

ctrl-z-9000-times commented 5 years ago

I don't like this change.

breznak commented 5 years ago

It is not unreasonable to know the physical limits of the sensor.

this is true, on the other hand, "optimizing" bounds exactly for a dataset is a bias.

A compromise to what you're saying?: Make min/max known, but is computed from ALL datasets.

ctrl-z-9000-times commented 5 years ago

"optimizing" bounds exactly for a dataset is a bias.

The way I see it is that each dataset was probably recorded using a physical device (the sensor) which has well known limitations on the range of values it can record. Although the NAB data sets did not save the min/max value of the sensor hardware, it can safely infer them.

smirmik commented 5 years ago

The way I see it is that each dataset was probably recorded using a physical device (the sensor) which has well known limitations on the range of values it can record. Although the NAB data sets did not save the min/max value of the sensor hardware, it can safely infer them.

In real projects, many sensors that provide data for the anomaly detection, give values that are calculated from a variety of indicators. Therefore, it is impossible to know the range of possible values, since it is not physical equipment.

Sorry for my intervened :)

smirmik commented 5 years ago

Minimum and maximum.

In software complexes for detecting anomalies, the architecture defines a method for obtaining data. In some systems, you can know the maximum and minimum initially. In some systems, you can use the first N samples, so that the system decides which range of values ​​is valid and performs the internal optimization of the detector settings. In some systems, even this is unacceptable. NAB is a universal benchmark for testing any detectors. The fact that it provides a minimum and maximum does not oblige to use it in the detector. On the other hand, the presence of the initial minimum and maximum led to the NAB that there are only detectors that use these values, since It is easier to optimize detector results. And this is contrary to the concept of "ideal detector". I think that a more correct approach is to remove the minimum and maximum and allow the detectors to solve this problem for themselves. But this will lead to the need to redesign all the detectors that are now in the NAB. An alternative solution is to make two tables of results. One for detectors that receive a minimum and a maximum from the outside, and another for those that do not use these values. This is a question of concept, there is no right decision from the point of view of logic.

Sorry for my English.

ankitnayan commented 4 years ago

Any ideas on how to solve this? This does not make sense for streaming data, even for stock market timeSeries data. I can update the model everyday with new Max and Min, but that's against the philosophy of onlineML. Anybody got any idea on how to solve this?

subutai commented 4 years ago

It's a valid point, and one we discussed quite a bit. We chose the current method due to the very high dynamic, but known, range of some of the streams. We found in practical applications this assumption (knowing the min/max) was valid in most cases. Dynamically figuring out min/max is a hard task, and beyond the scope of the NAB dataset. Maybe it's something that could be addressed in a future version.

For something like stock, I would suggest picking a large max upfront, say 2X or 4X the current max value. That should work fine. Keep in mind that raw stock market price data is inherently very unpredictable, so not a good dataset for any anomaly detection algorithm that I know of. I wouldn't expect good results no matter what.