support truth "diff" uploads

matthewcornell commented 3 years ago

Change truth uploads (and maybe be an option for non-truth forecast uploads?) to support uploading only "diffs," i.e., only the retracted/updated/new prediction elements. This involves relaxing this forecast version rule:

New forecast versions cannot imply any retracted prediction elements in existing versions, i.e., you cannot load data that's a subset of the previous forecast's data.

This issue is based on @nickreich 's comment:

re: truth upload jobs being large. I just wanted to throw out one small idea here. These jobs are big because the truth data is massive. And due to current "duplicate forecast" rules that require any updated forecast to also upload all original data, we need to upload all duplicate data, which makes these jobs MASSIVE. This makes sense for individual forecasts, but I wonder if maybe it is in the way for truth forecast uploads. E.g. if we relaxed the duplicate forecast upload requirement for truth data (maybe this is too big an ask to forecast updating logic?!), then we could have a lighter-weight (at least on the server) workflow that could go like this:

input to a local function a large new truth file that is ready to be uploaded

download current truth

compare current truth with the large new truth file, extract only the "diff" (i.e. retracted, updated and new observations)

upload only the retracted/updated/new forecasts, with implicitly all the rest of the observations staying the same.

matthewcornell commented 3 years ago

Tagging @nickreich and @elray1 with these questions:

Is this literally as simple as not calling the internal functionutils.forecast._is_pred_eles_subset_prev_versions() that does the check? Is t here some other constraint/check we should do instead?
Why shouldn't we also disable this rule for non-truth forecasts as well?

elray1 commented 3 years ago

My takes at answers:

Probably
I think we didn't want forecasters to "implicitly" retract past forecasts -- we want any retractions to be explicitly indicated with NULL values. (And in practice, retractions used to be handled by just omitting rows from csv files, which is different from the new intent here of "omitted rows don't change". so maybe this rule is just to help prevent confusion.)

matthewcornell commented 3 years ago

Thanks, Evan. I guess what I'm trying to get it is what is it about truth files (i.e., oracle forecasts) that's intrinsically/semantically different from non-truth forecasts such that relaxing the restrictions makes sense for the former but not the latter. For example, would truth never have retractions?

nickreich commented 3 years ago

I think mostly it's just a practical thing. We don't expect real forecasts to be repeatedly updated with large numbers of duplicated rows. but maybe I'm missing a bigger picture thing. I guess from some perspective you can't retract the truth.

elray1 commented 3 years ago

i could imagine it might be reasonable to lift this restriction for Zoltar, and make the strict rule a validation that's done on the hub repository? for the csv files in the hub repo, i think we definitely don't want this kind of behavior where a partial file means "leave previous values not represented in this file as they were". but maybe the validation logic for hub files and Zoltar uploads doesn't need to be the same.

matthewcornell commented 3 years ago

@elray1 are you saying that no duplicate detection should be done on the server side? Saving wasted space was a big motivator, right? And trusting the client - hmm.

elray1 commented 3 years ago

No, I think we should do the duplicate detection and removal on the server side.

I was saying that it might make sense to not have a different behavior for models and truth data in this respect, so that for both, we would allow "partial updates" that only contain the diff. I don't have a strong opinion about this, but basically wouldn't necessarily rule it out.

reichlab / forecast-repository

support truth "diff" uploads #319