hubverse-org / flusight_hub_archive

Hubversion of FluSight 1 (2015-2019)
MIT License
2 stars 1 forks source link

How to handle invalid values #11

Closed lmullany closed 1 month ago

lmullany commented 2 months ago

There are numerous submissions that would not pass basic validation checks if such checks were in place at the time of submission.

For example, we have model submissions where the cdf max exceeds 1, or the pmf does not sum to 1. Checking these requires some tolerance for precision differences, (i.e. 1 ~= 1.000001), but there examples where the difference between the expected max or sum is substantially different than 1

A summary of how many submissions fail these checks would be helpful

nickreich commented 2 months ago

@annakrystalli do you know how the current validations would handle a situation where there is some small precision differences?

lmullany commented 2 months ago

image

So, we have about 3% of ~80,000 prob mass functions that sum to a value that deviates from 1 by 0.001 or more.. I'll pull together a description of those

annakrystalli commented 2 months ago

Validations use dplyr::near() to determine equality and the tolerance is machine specific, determined by tol = .Machine$double.eps^0.5) so any file where the sums deviate from 1 by more than the tolerance would fail.

lmullany commented 2 months ago

okay, so when we use near(), depends on machine, but likely to be somewhere around ~10% of pmfs

elray1 commented 1 month ago

Original thought: At minimum, it seems like we should document specific validation failures that would impact analyses that people want to do and suggest mitigations? For example, if probabilities don’t sum to 1, people using the data run the risk of doing invalid analyses. We’ll want to either correct that or document it so people using the data know they need to correct that.

Revised idea after discussion: We could put the original data in a folder like data-raw in the github repo, with processed valid data living in model-output.

elray1 commented 1 month ago

Here's the guidance that was provided for the 2015/2016 challenge: "The probabilities for each prediction for each milestone should be positive and sum to 1. If the sum is greater than 0.9 or less than 1.1, the probabilities will be normalized to 1.0. If any probability is negative or the sum is outside of that range, the forecast will be discarded."

So maybe we could replicate that process in our cleanup here?