Some of the values from the beacons are quite erroneous, reading well outside of what is expected. A quick glance shows that these events tend to be short-lived indicating that the sensor isn't consistently reading high, but rather had some issue (power, fouling, etc.) that caused a high reading.
The Problem
There are currently two things we need to look into:
are these points read by the same beacon or are they dispersed amongst all the beacons?
How do we remove these points in the most robust way without removing too much data?
Where to look
I have started to explore the distributions of data points from each sensor in the beacon exploration notebook. The histograms provide some small insight into the problem.
This process works for the most part, but still there are some values that are retained that shouldn't be. These values become apparent when trying to calculate metrics like the percent change during a certain timeframe.
The Solution
We need some other way to smooth out the data. The five-minute averaged values are kept for the beacon, but perhaps we can apply some sort of filter to the data. I am thinking:
rolling average
rolling median
weighted average based on the standard deviation from the dataset (certainly there is something like this?)
Some of the values from the beacons are quite erroneous, reading well outside of what is expected. A quick glance shows that these events tend to be short-lived indicating that the sensor isn't consistently reading high, but rather had some issue (power, fouling, etc.) that caused a high reading.
The Problem
There are currently two things we need to look into:
Where to look
I have started to explore the distributions of data points from each sensor in the beacon exploration notebook. The histograms provide some small insight into the problem.
The Current Solution
The data are pre-processed in the [make_dataset}(https://github.com/intelligent-environments-lab/utx000/blob/master/src/data/make_dataset.py) file. The current processing is done by checking the z-score of the individual values and removing values whose absolute value is greater than 2.5.
This process works for the most part, but still there are some values that are retained that shouldn't be. These values become apparent when trying to calculate metrics like the percent change during a certain timeframe.
The Solution
We need some other way to smooth out the data. The five-minute averaged values are kept for the beacon, but perhaps we can apply some sort of filter to the data. I am thinking: