hubverse-org / schemas

JSON schemas for modeling hubs
Creative Commons Zero v1.0 Universal
4 stars 2 forks source link

Introduce `"mode"` output_type #40

Open annakrystalli opened 1 year ago

annakrystalli commented 1 year ago

Opening this issue to move discussions on this topic to the repo.

From slack:

@nickreich :

[5 days ago] How would people feel about adding an output_type of "mode" to the other existing types? This came up today in a conversation with @annakrystalli as it seems like a possibly natural form of a point estimate for a categorical target. E.g. a "mean" or "median" wouldn’t make sense. I will note that the mode could be extracted from the representation of a probability mass function for a categorical outcome, but that would require a probabilistic forecast. If you like the idea, please just add a :white_check_mark: . If you have questions or comments or objections, please add a note here. Thanks!

One comment on this after discussing briefly with Evan is that the tabular data representation would maybe be kind of ugly, e.g. since we can only have numeric objects in the “value” column, maybe it would look something like this?

output_type type_id value
"mode" ["cat1", "cat2", "cat3"] [0,1,1]

where the type-id is an array of the possible values of the categorical variable and the array in value would be indicating which value(s) are the mode? Or maybe this would need to be spread over two rows, to keep value purely numeric?

annakrystalli commented 1 year ago

Response by @elray1

Maybe another option could be to allow submitters to only include the rows that are modes. I mentioned this to nick earlier, but I remember us discussing something similar at some point in the past on a call where we were talking about dates. I can't remember the context clearly enough to think of what to look up, but there may be discussion somewhere in a github issue on the schemas or hubDocs repo?

annakrystalli commented 1 year ago

Response by @nickreich

Those suggestions make sense to me, so maybe something like output_type type_id value
"mode" "cat2" 1
or in a multimodal case output_type type_id value
"mode" "cat2" 1
"mode" "cat3" 1
annakrystalli commented 1 year ago

Comment by @nickreich feels like if we really wanted to support this we’d then have to add some special handling for these cases. maybe we file this as a feature request for the future for now? include mode as a data type but basically don’t handle it for these cases yet?

annakrystalli commented 1 year ago

In general I support the introduction of mode as a valid statistical point parameter to submit.

I do feel however that the changes required to accommodate categorical variables, whether forcing value to be a character column or mapping integers to categories (as suggested in #39 ) might be more effort than worth it.

I just wanted to point out that it's really easy to get the mode(s) from a PMF accurately though a simple hub_connection query. See pseudo-example below:

set.seed(1)
# pseudo-fub connection to data
hub_connection <- tibble::tibble(
    output_type = "pmf",
    type_id = as.character(1:10),
               value = as.vector(rmultinom(1, 100, runif(10))/100)
               )

hub_connection
#> # A tibble: 10 × 3
#>    output_type type_id value
#>    <chr>       <chr>   <dbl>
#>  1 pmf         1        0.03
#>  2 pmf         2        0.05
#>  3 pmf         3        0.12
#>  4 pmf         4        0.16
#>  5 pmf         5        0.05
#>  6 pmf         6        0.16
#>  7 pmf         7        0.2 
#>  8 pmf         8        0.17
#>  9 pmf         9        0.06
#> 10 pmf         10       0

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
hub_connection %>%
    filter(output_type == "pmf",
           value == max(value))
#> # A tibble: 1 × 3
#>   output_type type_id value
#>   <chr>       <chr>   <dbl>
#> 1 pmf         7         0.2

Created on 2023-04-10 with reprex v2.0.2

On the other hand, getting the accurate mean, mode and median of a continuous/discrete (count) distribution from a quantile or cdf is not necessarily straightforward and dependant on e.g. the quantiles reported (please correct me if I'm wrong!). So it might make sense to be able to report mode for such distributions but not worth the effort for nominal/ordinal/binary variables given the ease of obtaining it accurately from the pmf and the cost of accommodating encoding it.

nickreich commented 1 year ago

I'm basically on board with the idea that it's "not worth the effort" at this time. If we were more focused on non-probabilistic forecasts, then I might lobby harder for it, but given that so much of what we do has a probabilistic slant and that as @annakrystalli points out you can obtain a mode (which usually we might only want for a categorical outcome) from the natural probabilistic encoding for categorical variables, then I feel that this is less important for now.