For a hub to be successfully accessed as an arrow dataset, column data types should not change from round to round.
Generally many task IDs that are covered by our schema shouldn't change data type in further rounds as that's somewhat fixed by the schema. Custom task IDs however, which are beyond our control, and the output_type_id column have the potential to change and this could indeed cause problems downstream. This is mainly a problem for parquet files (but has a small chance to cause problems in csvs too).
This is why early on when we had lots of discussions about this we allowed for a hub to override the automatic detection of the output_type_id data type in the hubData::create_hub_schema() function, used to determine the overall hub schema from the tasks.json config file. To future proof the output_type_id admins could set the value of output_type_id_datatype to the safest, most future proof data type, i.e. character.
Admittedly, the option to override the output_type_id column data type has not been propagated as a hubValidationsvalidate_*() fn argument and should.
This will ensure hubData::create_hub_schema() uses the output_type_id_datatype setting in tasks.json to determine the output_type_id column data type. If not, fall back to "auto". It also allows hub administrators during validation if required to override this setting if required.
Background
For a hub to be successfully accessed as an arrow dataset, column data types should not change from round to round. Generally many task IDs that are covered by our schema shouldn't change data type in further rounds as that's somewhat fixed by the schema. Custom task IDs however, which are beyond our control, and the output_type_id column have the potential to change and this could indeed cause problems downstream. This is mainly a problem for parquet files (but has a small chance to cause problems in csvs too).
This is why early on when we had lots of discussions about this we allowed for a hub to override the automatic detection of the
output_type_id
data type in thehubData::create_hub_schema()
function, used to determine the overall hub schema from thetasks.json
config file. To future proof theoutput_type_id
admins could set the value ofoutput_type_id_datatype
to the safest, most future proof data type, i.e. character.Admittedly, the option to override the
output_type_id
column data type has not been propagated as ahubValidations
validate_*()
fn argument and should.Soon there will also be a
tasks.json
property where this setting can be configured, communicated and used to set the data type of theoutput_type_id
column across all rounds (see https://github.com/hubverse-org/schemas/issues/87 and https://github.com/hubverse-org/hubData/issues/44 for details)Introduce
output_type_id_datatype
argument across relevantvalidate_*()
fns and set"from_config"
as default settingOnce https://github.com/hubverse-org/schemas/issues/87 and https://github.com/hubverse-org/hubData/issues/44 are implemented, introduce an
output_type_id_datatype
argument tovalidate_pr()
,validate_submission()
andvalidate_model_data()
and set default value to"from_config"
.This will ensure
hubData::create_hub_schema()
uses theoutput_type_id_datatype
setting intasks.json
to determine theoutput_type_id
column data type. If not, fall back to"auto"
. It also allows hub administrators during validation if required to override this setting if required.