Problem: The selection of leading and trailing periods is based on a number of assumed observations (debias_pref_sample_size_leading) rather than the use of time deltas. This can give bad results if there is a large (seasonal) gap located in the leading/trailing periods.
Proposal:
Use the leading/trailing min max time deltas as timedeltas (so max 30 days ahead) and use the minimum criterium as a minimum in several observations.
# Select all leading and all trailing obs
leading_period = obs[obs["datetime"] < gap.startgap]
trailing_period = obs[obs["datetime"] > gap.endgap]
logger.debug(f' {leading_period.shape[0]} leading records, {trailing_period.shape[0]} trailing records.')
# some derived integers
poss_shrinkage_leading = leading_period.shape[0] - debias_min_sample_size_leading
poss_shrinkage_trailing = trailing_period.shape[0] - debias_min_sample_size_trailing
poss_extention_leading = leading_period.shape[0] - debias_pref_sample_size_leading
poss_extention_trailing = (
trailing_period.shape[0] - debias_pref_sample_size_trailing
)
# check if desired sample sizes for leading and trailing are possible
if (leading_period.shape[0] >= debias_pref_sample_size_leading) & (
trailing_period.shape[0] >= debias_pref_sample_size_trailing
):
logger.debug("leading and trailing periods are both available for debiassing.")
# both periods are oke
leading_df = leading_period[-debias_pref_sample_size_leading:]
trailing_df = trailing_period[:debias_pref_sample_size_trailing]
Proposal by @amberJ99
Problem: The selection of leading and trailing periods is based on a number of assumed observations (
debias_pref_sample_size_leading
) rather than the use of time deltas. This can give bad results if there is a large (seasonal) gap located in the leading/trailing periods.Proposal: Use the leading/trailing min max time deltas as timedeltas (so max 30 days ahead) and use the minimum criterium as a minimum in several observations.