Improve validation performance

Background

While our current validation functionality works well enough on smaller files / hubs with less complex config files, it can be much slower on larger files / more complex config. This has also been noted and reported by the community (e.g. #86)

The most time consuming functions are checking that the combination of values are valid and validating that all required value combinations have been submitted.

There are a number of reasons/bottlenecks

The size of the expanded grid of all possible values in a complex hub can be very large.
The expansion of value combinations which are effectively invalid because their value is related to the value of another variable (e.g target_end_date which only has a single valid value dependant on origin_date and horizon - see #38.) Expanding the values of such task ids unnecessarily increases the size of the expanded value grid while the actual validation is performed via optional validation check hubValidations::opt_check_tbl_horizon_timediff().
Creating an index via conc_rows to then split the submitted table and check for required values in check_tbl_values_required.

These are likely the most effective areas to direct effort to improve performance.

Specific Actions

[ ] Perform memory intensive validations in a piecemeal way: Once https://github.com/hubverse-org/hubData/issues/39 is implemented, we should refactor any checks making use of expand_model_out_val_grid() to perform the checks one output type at a time. This way we avoid burdening memory with the full expanded grid at any one time.
[ ] Memoise expand_model_out_val_grid(): As this function is called a number of times but always returns the same result for the same config, it's a good candidate for memoisation (#85) .
[ ] Optimise conc_rows: I've already tried and failed to improve the performance of this function but given it's the main bottleneck to check_tbl_values_required, it feels important to revisit and try again.
[ ] Introduce mechanism for excluding task ids from expanded grid of valid values: This relates to task ids like target_end_date in which expanding their values is meaningless yet can be very memory consuming. For such task ids, validation would involve:
- validating that the unique values in the task id column are valid with respect to the config (instead of checking them as part of combinations)
- Using custom/optional functions to validate expected properties/relationships of such variables.
- In expanded grids such task ids would likely be encoded as NAs

hubverse-org / hubValidations

Improve validation performance #93

Background

Specific Actions