While our current validation functionality works well enough on smaller files / hubs with less complex config files, it can be much slower on larger files / more complex config. This has also been noted and reported by the community (e.g. #86)
The most time consuming functions are checking that the combination of values are valid and validating that all required value combinations have been submitted.
There are a number of reasons/bottlenecks
The size of the expanded grid of all possible values in a complex hub can be very large.
The expansion of value combinations which are effectively invalid because their value is related to the value of another variable (e.g target_end_date which only has a single valid value dependant on origin_date and horizon - see #38.) Expanding the values of such task ids unnecessarily increases the size of the expanded value grid while the actual validation is performed via optional validation check hubValidations::opt_check_tbl_horizon_timediff().
These are likely the most effective areas to direct effort to improve performance.
Specific Actions
[ ] Perform memory intensive validations in a piecemeal way: Once https://github.com/hubverse-org/hubData/issues/39 is implemented, we should refactor any checks making use of expand_model_out_val_grid() to perform the checks one output type at a time. This way we avoid burdening memory with the full expanded grid at any one time.
[ ] Memoise expand_model_out_val_grid(): As this function is called a number of times but always returns the same result for the same config, it's a good candidate for memoisation (#85) .
[ ] Optimise conc_rows: I've already tried and failed to improve the performance of this function but given it's the main bottleneck to check_tbl_values_required, it feels important to revisit and try again.
[ ] Introduce mechanism for excluding task ids from expanded grid of valid values: This relates to task ids like target_end_date in which expanding their values is meaningless yet can be very memory consuming. For such task ids, validation would involve:
validating that the unique values in the task id column are valid with respect to the config (instead of checking them as part of combinations)
Using custom/optional functions to validate expected properties/relationships of such variables.
In expanded grids such task ids would likely be encoded as NAs
Background
While our current validation functionality works well enough on smaller files / hubs with less complex config files, it can be much slower on larger files / more complex config. This has also been noted and reported by the community (e.g. #86)
The most time consuming functions are checking that the combination of values are valid and validating that all required value combinations have been submitted.
There are a number of reasons/bottlenecks
target_end_date
which only has a single valid value dependant onorigin_date
andhorizon
- see #38.) Expanding the values of such task ids unnecessarily increases the size of the expanded value grid while the actual validation is performed via optional validation checkhubValidations::opt_check_tbl_horizon_timediff()
.conc_rows
to then split the submitted table and check for required values incheck_tbl_values_required
.These are likely the most effective areas to direct effort to improve performance.
Specific Actions
expand_model_out_val_grid()
to perform the checks one output type at a time. This way we avoid burdening memory with the full expanded grid at any one time.expand_model_out_val_grid()
: As this function is called a number of times but always returns the same result for the same config, it's a good candidate for memoisation (#85) .conc_rows
: I've already tried and failed to improve the performance of this function but given it's the main bottleneck tocheck_tbl_values_required
, it feels important to revisit and try again.target_end_date
in which expanding their values is meaningless yet can be very memory consuming. For such task ids, validation would involve:NA
s