reconsider the name `output_type_id`

elray1 commented 1 year ago

Dylan noted that this is a potentially confusing name for this column, as someone who is familiar with ideas about relational databases who is coming to the hubverse for the first time would interpret this as being the unique identifier for the output type (e.g. "quantile" = 1, "sample" = 2, etc). He suggests perhaps something like "output_value_metadata".

annakrystalli commented 1 year ago

So while I'm not super excited by the idea of implementing such a change, I'm sympathetic to the motivation.

Personally I find output_value_metadata a bit too long and a little vague.

From the start I thought something like output type attribute (output_type_attr) might be a good option.

It's interesting also that the suggestion involves value and not output_type. Indeed I noticed that, in Italian, output_type_id had been translated as value_id (id_valore), which also made me consider whether output type ID is indeed a property of the output type or whether it should be considered a property/attribute of the value it relates to.

In this case we could have value_attr.

elray1 commented 1 year ago

I'm also sympathetic to the criticism of the output_type_id name, and I'm open to changing it.

I don't love the proposed names involving value for two reasons:

I find it helpful to get some orientation by thinking of the names for these quantities that we would use if we weren't trying to be so abstract: output_type = "sample": sample index, output_type = "quantile": quantile probability level, output_type = "pmf" or "cdf": bin label or target variable value. These things are framed as specifications of a detail about or refinements to the output_type.
Although I agree that these are ultimately specifying an attribute of the value within the row, I think they are doing so in essentially the same way that values of the task id variables are. In some sense, the value is the model's "solution" to the prediction task specified by all of the other columns. Names like value_attr, value_id, and output_value_metadata all feel very generic and like they could equally well describe any of the columns. I think we want something here that gets more specifically at how we're using this column in particular.

annakrystalli commented 1 year ago

Makes sense. How do you feel about output_type_attr?

elray1 commented 1 year ago

output_type_attr is ok by me

LucieContamin commented 1 year ago

Same as Anna, I'm not super excited by the idea of implementing this change, I kind of understand the motivation. Personally, I have no issue with output_type_id, and I am ok with output_type_attr.

I don't think there is a "perfect" column name and the name we choose might still be confusing for some users. A maybe unrealistic example just to illustrate my last phrase: attr in R can be understood as a way of tagging additional information, like metadata, which is not really the case here.

elray1 commented 1 year ago

continuing to brainstorm somewhat unsatisfying options: output_type_level?

elray1 commented 1 year ago

We decided to keep output_type_id, acknowledging that it is not perfect

hubverse-org / schemas

reconsider the name `output_type_id` #60