hubverse-org / hubDocs

https://hubverse.io
5 stars 6 forks source link

Document the requirement for a stable hub schema across rounds #143

Closed annakrystalli closed 1 month ago

annakrystalli commented 3 months ago

Background

For a hub to be successfully accessed as an arrow dataset, column data types should not change from round to round. We want all files across all rounds to have the same schema.

Generally many task IDs that are covered by our schema shouldn't change data type in further rounds as that's somewhat fixed by the schema. Custom task IDs however, which are beyond our control, and the output_type_id column have the potential to change and this could indeed cause problems downstream. This is mainly a problem for parquet files (but has a small chance to cause problems in csvs too).

This is why early on when we had lots of discussions about this we allowed for a hub to override the automatic detection of the output_type_id data type in the hubData::create_hub_schema() function, used to determine the overall hub schema from the tasks.json config file. To future proof the output_type_id admins could set the value of output_type_id_datatype to the safest, most future proof data type, i.e. character.

Improve documentation

Admittedly this is not discussed in detail in hubDocs and should, as well as the option to override the output_type_idcolumn type. We need to improve the docuemntation on this aspect and get admins to think about the issue early on and warn them to avoid situations that might introduce changes in model output column data types.

Once https://github.com/hubverse-org/schemas/issues/87 is also complete, we should document how they can use that setting to future proof the output_type_id column data type.