This PR is a complete rewrite of the Flat Line Test.
This rewrite includes:
A bug fix, to handle a flat line starting at the beginning of the timeseries
If the data flat-lines starts at the beginning of the timeseries, it was getting marked as FAIL before hitting the minimum time threshold. This was because when we checked if all the values in a chunk were within the threshold, we were counting the empty parts of the chunk.
See test_flat_line_starting_from_beginning in test_qartod.py
Massive speed improvements
The previous implementation manually chunked the timeseries, and then ran a for loop over each chunk. This was incredibly slow -- for example it took ~5s to process a timeseries with 90k observations on my laptop
This implementation uses a numpy rolling window to avoid the for loop. For the same dataset on my laptop, this took 0.1 seconds.
See QartodFlatLinePerformanceTest in qartod_test.py
A different algorithm to find flat lines. While this algorithm differs slightly from the one described in the QARTOD manuals, it behaves in the same way for the majority of cases, and does a better job of handling certain edge cases. See section below.
Algorithm Changes
The current implementation compares the current point n to a number of previous observations, and flags it if those observations are within a certain tolerance. This follows what the QARTOD manual says (pg 18):
This test compares the present observation (n) to a number
(REP_CNT_FAILor REP_CNT_SUSPECT) of previous observations.
Observation n is flagged if it has the same value as previous observations
within a tolerance value, EPS.
However, this can lead to un-intuitive results in some cases.
It doesn't seem right that the test can go from FAIL to PASS to FAIL again, without a SUSPECT transition in between.
This is happening because in the current implementation, we create an envelope around the current point, based on the threshold size. If all the points in the chunk are in that envelope, then it fails the test. If any of the points in the chunk are outside the envelope, it passes.
There's another way to do this: Use a rolling window, with endpoint at point n, and calculate the range of values in the window (using abs(max-min), or "point-to-point", method). If that range is within tolerance then the point is flagged.
In this example, after it goes over the "hump", the range of values in the window 24hrs before point n still exceeds the threshold, so the point passes the test.
For more "traditional" flat line scenarios, the two implementations behave exactly the same:
This PR is a complete rewrite of the Flat Line Test.
This rewrite includes:
test_flat_line_starting_from_beginning
intest_qartod.py
chunk
ed the timeseries, and then ran afor
loop over each chunk. This was incredibly slow -- for example it took ~5s to process a timeseries with 90k observations on my laptopfor
loop. For the same dataset on my laptop, this took 0.1 seconds.QartodFlatLinePerformanceTest
inqartod_test.py
Algorithm Changes
The current implementation compares the current point
n
to a number of previous observations, and flags it if those observations are within a certaintolerance
. This follows what the QARTOD manual says (pg 18):However, this can lead to un-intuitive results in some cases.
Take this example:
Current implementation:
It doesn't seem right that the test can go from FAIL to PASS to FAIL again, without a SUSPECT transition in between.
This is happening because in the current implementation, we create an envelope around the current point, based on the threshold size. If all the points in the chunk are in that envelope, then it fails the test. If any of the points in the chunk are outside the envelope, it passes.
There's another way to do this: Use a rolling window, with endpoint at point
n
, and calculate the range of values in the window (usingabs(max-min)
, or "point-to-point", method). If that range is withintolerance
then the point is flagged.In this example, after it goes over the "hump", the range of values in the window 24hrs before point
n
still exceeds the threshold, so the point passes the test.For more "traditional" flat line scenarios, the two implementations behave exactly the same:
Note: this PR supersedes the following PRs: