Values outside QC test domain are incorrectly flagged as in error

AdamRJensen commented 1 year ago

I've found that the QC checks in pvanalytics flag values outside the testing domain as erroneously, which I strongly believe is against the purpose of the testing domain.

As an example, the BSRN comparison and diffuse ratio tests are stated to only be performed for GHI>50 W/m^2. The reason for this threshold is that the uncertainty at lower irradiance is too high to perform the test reliably. This is NOT saying that values lower than 50 W/m^2 should be flagged.

If values lower than 50 W/m^2 should be flagged, it certainly does not make sense for the BSRN lower limits on the individual components to be -4 W/m^2. For example, the below data point which matches the closure equation perfectly (i.e., GHI = DHI+DNI*cos(Z)) is flagged as erroneously because it's below 50 W/m^2.

    qcrad_consistency_flag = check_irradiance_consistency_qcrad(
        solar_zenith=pd.Series(80),
        ghi=40,
        dhi=40,
        dni=0)

I have a series of QC checks that I would like to add, but this issue is a blocker for now.

This issue is related to https://github.com/pvlib/pvanalytics/issues/188#issuecomment-1481728592 and https://github.com/pvlib/pvanalytics/pull/167#issuecomment-1373484368

cwhanse commented 1 year ago

@AdamRJensen if the BSRN tests aren't applied for GHI < 50 W/m2, what do the tests return for such values? I assume they don't roll up to a True/False flag, or do they?

I don't think False (in pvanalytics) should be interpreted as erroneous, it means not True, where True means the value passes the test. We could say more in the docstrings than "True where value passes limits test."

AdamRJensen commented 1 year ago

@AdamRJensen if the BSRN tests aren't applied for GHI < 50 W/m2, what do the tests return for such values? I assume they don't roll up to a True/False flag, or do they?

I don't think False (in pvanalytics) should be interpreted as erroneous, it means not True, where True means the value passes the test. We could say more in the docstrings than "True where value passes limits test."

^I think this is a problematic way of implementing QC tests and where the real issue lies!

I think my concern is best explained using an example. Let's consider two hypothetical tests from two different authors aiming to identify two different issues. The first test is suitable for SZA < 75 and the second test is suitable for SZA >= 75. Given that their domains do not overlap, one of the two flags will always be marked False (following the pvanalytics convention). Consequently, there are three possible outcomes (True, False), (False, True), or (False, False). Once I've applied all my various QC tests I want to combine them and assess the final decision of the data point. As I only want to use data points that didn't fail any tests, I use the "AND" operator, but the end result for the example above will always be False no matter the data point.

I think the flawed part of the pvanalytics QC implementation is that it follows a strategy of (1) determining if a data point is valid (defaults to False), instead of (2) determining whether a data point is suspicious (defaults to True). QC tests are developed to find suspicious data points and not to validate data, thus strategy (2) should be applied (flagging of suspect data).

@cwhanse I'm having hard time conveying this information and would be happy to have a quick chat about it.

kandersolar commented 1 year ago

Put another way, the problem is that we are trying to fit three states (pass, fail, N/A) into a two-state variable.

I don't think False (in pvanalytics) should be interpreted as erroneous, it means not True, where True means the value passes the test. We could say more in the docstrings than "True where value passes limits test."

I suspect most users will not grasp this distinction. I think most will assume "False = Bad", and I'm skeptical that including careful wording in the docstrings would be enough to disabuse them of this notion.

My first thought was to return np.nan for inputs outside the test domain, but pandas doesn't accept mixed boolean/nan masks with .loc (which I assume is the primary use case for these outputs): ValueError: Cannot mask with non-boolean array containing NA / NaN values.

In the absence of first-class ternary functionality in pandas, I think I'd support switching to a convention of False means failed, True means passed or N/A. But maybe we should poll users to how they want to be able to use these functions.

cwhanse commented 1 year ago

This would be a major design change, and I see the arguments in its favor. @kperrynrel can you discuss this with your colleagues?

The proposal is to return True for values outside the filter boundary, i.e., values for which the test doesn't apply. Currently, filter functions return False for such values.

AdamRJensen commented 1 year ago

@kperrynrel @cwhanse Here's an idea for perhaps a middle ground and, if anything, a solution that can let the addition of new tests proceed without breaking causing any breaking changes.

I propose to introduce an argument that lets the user decide what fill value to use outside the test domain. The existing tests return False for values outside the test domain, so this can be the default fill value, thus retaining the current functionality. It would, at the same time, provide flexibility and let users choose another fill value, e.g., True (as per my proposal) or nan.

The concept is based on a discussion with @kandersolar, though he may not endorse it.

cwhanse commented 1 year ago

Good idea - fill_value, out_of_bounds, something like that would work.

pvlib / pvanalytics

Values outside QC test domain are incorrectly flagged as in error #191