jarad / FluSight

An R package containing functions used in the CDC Flu Forecasting competition
GNU General Public License v3.0
12 stars 7 forks source link

Verify_entry error hierarchy #5

Open craigjmcgowan opened 7 years ago

craigjmcgowan commented 7 years ago

Do we want to think about having the verify_entry program check errors hierarchically and stop when it finds an error? I created a test to check for the correct structure in all the columns except "value" by substituting invalid text for one column. While the function found the error, it also returned a lot more errors from later checks since the entry couldn't be sorted on that column and later checks depend on the entry being properly sorted. This results in a lot of error text that's uninformative, since the actual error is just in the improper values in one column.

I think we could address this by making the errors hierarchical in this order: column names, non-value columns, negative probabilities, probabilities not summing to 1. Any error would stop the check since an error at that level affects the validity of later errors. I can make these edits if you agree and add some more tests to the appropriate file.

jarad commented 7 years ago

I was wondering about this as well and I think it makes a lot of sense and yes, my current version does not do this. The only aspect I do not like is that a hierarchical structure can be frustrating as a user because you never know when you will finally get to a correct entry. But, as currently implemented, there are lots of errors that are uninformative.

Initially I was thinking the check would essentially be one large function (which would call small functions), but perhaps if we added a test to each individual small function that tests for exactly what that functions expects, it will make the code more readable and modifiable. For example, the function that checks column names would naturally fit in the arrange_entry function since that function does the arranging on that column.

craigjmcgowan commented 7 years ago

That sounds like a good idea to me. I can make some headway on that this afternoon and can update the code before the end of the day.

craigjmcgowan commented 7 years ago

I rearranged the errors so that mismatched column names are checked first in arrange_entry, then column structure is checked in verify_structure, and then probabilities are checked in verify_probabilities. I also added some code to check for missing probabilities as well as a new function verify_point to check the point predictions. That returns an error if the prediction is negative but just a warning if missing, as the point predictions aren't required.

I also added checks for the verify_probabilities and verify_point functions. I had the checks iterate through all possible location/target combinations, but if you think that's overkill they can be simplified.

jarad commented 7 years ago

Ideally I think all of these checks compare to inst/extdata/valid-test.csv. That way, in the future, when valid-test.csv changes, there will be relatively little modification to the code.