create new prediction_types for mean and median

nickreich commented 1 year ago

To try to bring Zoltar closer to our current thinking about how to represent point forecasts, we should add mean, median, and perhaps mode to the list of valid prediction types. However, this raises a question of what target_types are valid for each of these new prediction_types.

This is the current table from Zoltar:

continuous, discrete and date target types: all three are valid options.
nominal: only valid for mode (mean and median are numeric operations that don't apply to nominal outcomes)
binary: only valid for mode and median
- the data_type for a binary prediction is boolean, therefore a mode and median mean the same thing in this context: the value (0 or 1) that appears more often in a vector of 0s and 1s
- a mean would imply a continuous value between 0 and 1. if a probability were desired as the outcome, then a continuous target type would be used with limits of 0 and 1.

tagging @elray1 and @annakrystalli for additional input.

nickreich commented 1 year ago

Noting that the idea here would be to

add mean, median and perhaps mode prediction types to zoltar, leaving point prediction type as legit for legacy projects.
make it so that new projects cannot use point target types.
as a later task, consider migrating some point data to the other new prediction types.

Noting that this is related to https://github.com/Infectious-Disease-Modeling-Hubs/hubData/issues/10

matthewcornell commented 7 months ago

Capturing this as a table: This will be the new "Valid prediction types by target type" table:

matthewcornell commented 7 months ago

Around this line in the function _load_truth_data() you will see that we use the 'point' prediction type for the prediction dictionary data that generated from incoming truth rows. (This dictionary is then loaded into the database via load_predictions_from_json_io_dict().) Since we are retiring point types, how do we handle/represent truth with our three point replacements (mean, median and perhaps mode)?

nickreich commented 7 months ago

@matthewcornell The table looks correct to me.

Your question about representation of truth data is a good one. Given that truth data are a separate kind of data stored in Zoltar, I'm not sure it really matters what type we store them as. Honestly, we could keep them stored as "point", and have them be the only thing that could use that prediction type? Or we could just pick one of "mean", "median" or "mode" and use that from here on out? (would that leave backwards compatibility issues for truth data stored in a certain way in the system now?) I'm imagining that we might need to pick one prediction type and store all truth data in the system as that type? Maybe for that reason we should choose "mode" as it is valid for all target types, as is "point"?

matthewcornell commented 7 months ago

@nickreich WRT the underlying truth representation ("point" vs. ...) I don't think there will be an external impact (queries come back as CSV with a catch-all 'value' column for "point" and - with this change - the other three), so I'm ok staying with "point" or switching to "mode". I suggest we use the latter so that we are moving away from "point". Yes, that will mean old truth data is "point" and new is "mode", but we can change the former to latter if/when we do your item #3 above (consider migrating some point data to the other new prediction types).

matthewcornell commented 7 months ago

We will need to review constraints specified in https://docs.zoltardata.com/validation/ - which apply to the three new types? E.g., are they the same as point?

matthewcornell commented 7 months ago

We will also need to review conversions from/to the three new types here: https://docs.zoltardata.com/forecastqueryformat/#automatic-prediction-type-conversion . Again, same as point?

nickreich commented 7 months ago

The Point prediction elements constraints could be the same for mean, median and mode.

In the Tests for Prediction Elements by Target Type section, I think all of the uses of "point" prediction type could just be replaced by the "mean, median, and mode" types.

We want the same functionality, validation and constraints for the three new types as we had for point.

The conversions also seem more or less straight forward. My suggestion might be to aspirationally support the following conversions:

mean <- named, sample
median <- named, sample
mode <- named

I intentionally left out the conversion of sample -> mode as there could be complications here for continuous outcomes. So I vote for just leaving it off for now.

And I see that we have some existing conversions supported

conversions: point <- sample (uses either statistics.mean() or statistics.median() depending on the convert.point option)

Which I think could just be translated as the functions to convert from sample to mean and median types.

matthewcornell commented 7 months ago

I'm making progress, but I think it will be clean to do this via three implementation phases:

phase one: add the three new types. keep the point type (this issue)
phase two: depreciate the point type. disable it for new forecasts. change _load_truth_data() from "point" to "mode". replace it in tests with mode. do not convert existing forecasts (issue #377)
phase three: retire the point type. remove it from the database and code. convert existing forecasts from point to mode (will not be doing this phase)

Thoughts?

nickreich commented 7 months ago

This sounds like a good approach to me overall. I think phase one and two are nicely distinct and clearly we want to do them, in that order.

I am not yet convinced that we want to do phase three, as it would require "relabeling" prior data in places where we cannot be certain what the original intent or data type should be. So maybe we could get away with just keeping it as a vestigial artifact and never actually retiring/removing "points" at all?

matthewcornell commented 7 months ago

@nickreich , in this section of the docs, I think # 1 should say "convert FROM" instead of "convert TO", agreed?:

Auto-conversion is activated via two query parameters:

Specify the forecast types (# 5 above) that you want the program to convert TO (the default, if unspecified is all types).

Include query options for the conversions you want. Including an option does two things: a) it tells Zoltar that it should convert TO that type (i.e., that conversion is activated), and b) possible details about how to do the conversion.

nickreich commented 7 months ago

@matthewcornell I'm confused about what the "automatic conversion" means. Are some predictions converted automatically? do we specify whether or not they are converted somewhere? Could you point to an example in Zoltar where this is happening (or configured) currently?

matthewcornell commented 7 months ago

I agree that "auto" is misleading; I will change it to just be "conversion".
Conversion happens at the user's explicit request via query options. For example, {... "options": {"convert.mean": true}} ("convert.mean" is a new option).

nickreich commented 7 months ago

Regarding your question about FROM vs. TO, I just don't have enough of a sense from the documentation what is actually happening with the conversion to answer your question. Maybe we should just try to re-write that part of the documentation to more clearly reflect the action that is being taken?

matthewcornell commented 7 months ago

@nickreich Zoltpy provides the (somewhat experimental) feature of bulk import/export specified in this issue: [investigate support of bulk zoltar queries #321] - docs here. IIRC Martha tried using it and it wasn't viable (maybe due to trouble accessing Python via R?), and so my question is whether it's worth my updating that part of the code to use the current issue's new prediction types. I suggest we remove the code from both zoltpy and the server unless we think we'll use it again. Thoughts?

nickreich commented 7 months ago

I could maybe see us using this again at some point in the future, but I agree that it feels vestigial at this point and could be removed for now.

reichlab / forecast-repository

create new prediction_types for mean and median #367