`HierarchicalTest`: Better(?) priors for the data set means

dpaetzel commented 2 years ago

Here and here (Stan implicitly samples this from a uniform, which is the intended behaviour as described in the publications on the hierarchical model) the prior on delta is set as Uniform(-max(abs(x)), max(abs(x))). In the publications on the hierarchical model, this is Uniform(-1, 1), presumably since accuracy is the metric under consideration whose maximum is 1 (although this does not explain the lower bound of -1).

My question is: Wouldn't Uniform(min(x), max(x)) be better-suited in general than the current choice of Uniform(-max(abs(x)), max(abs(x)))? Many metrics are asymmetric around 0:

logarithmic metrics may take on highly negative values but only moderately positive ones
most error measures (MSE, MAE, …) cannot take on values below 0

Because of that, using Uniform(-max(abs(x)), max(abs(x))) as the default prior on delta seems to be unnatural to me but I may be mistaken?

A possible way to overcome this issue flexibly is to introduce another parameter to HierarchicalTest.sample, named e.g. data_set_mean_prior, and then handling the probably most-used cases but also allowing users to specify the lower and upper bound by themselves:

# The original choice.
if data_set_mean_prior == "symmetric-max":
    delta_upper = np.max(np.abs(diff))
    delta_lower = -delta_upper
# If only positive values are sensible, this may be a better choice.
elif data_set_mean_prior == "min-max":
    delta_lower = np.min(diff)
    delta_upper = np.max(diff)
elif data_set_mean_prior == "zero-max":
    delta_lower = 0
    delta_upper = np.max(diff)
elif isinstance(data_set_mean_prior, tuple):
    delta_lower = data_set_mean_prior[0]
    delta_upper = data_set_mean_prior[1]
else:
    raise ValueError("data_set_mean_prior has to be one of"
                     "\"symmetric-max\", \"min-max\", \"zero-max\""
                     "or of type tuple")

What do you think? Should I create a PR for this?

gcorani commented 2 years ago

Hi David, thanks for your comment. In the paper I was considering the difference of accuracy between classifiers, which is bounded in [-1, 1]. Alternative bounds should be considered if the metric has a different scale, but it seems to me that -max(abs(x)), max(abs(x)) is sensible, as long as we deal with paired differences between algorithms.

Does this address your question? (ps: I'll check later also your other comments)

best, Giorgio

dpaetzel commented 2 years ago

Hi Giorgio, thanks a lot for your response. :slightly_smiling_face:

You're right, my mistake. [-1, 1] is the correct interval for differences of accuracy; I simply overlooked that the scale changes from [0, 1] to [-1, 1] as soon as you consider differences. The same goes for the general case of [min(x), max(x)] changing to [-max(abs(x), max(abs(x))].

I'll close this issue.

janezd / baycomp

`HierarchicalTest`: Better(?) priors for the data set means #12