Add conservative adjustment if get_jackknife_draws fails for a leave-one-subject-out dataset.

wolbersm commented 2 years ago

For method_condmean(type="jackknife"), it may occur that the estimator fails for only one or very few leave-one-subject-out datasets. Currently, the method fails altogether in this setting and this is ok for 99.9% of the cases, as this should be extremely rare. However, it still seems reasonable to build in a relatively conservative handling of this "worst case scenario" to avoid complete failure.

My proposal is to: a) Add a tolerance threshold which by default is 0 but allows up to x% jackknife failures. b) If the MMRM/estimator fails on a leave-one-subject-out datasets, treat it as follows:

the corresponding draws, imputed dataset and estimators are NA/NULL
for the jackknife SE calculation, the estimators for all failed leave-one-out dataset are conservatively set as equal to the observed estimator amongst all leave-one-out datasets which didn't fail and which had the largest absolute distance to the estimator from the original dataset.

nociale commented 2 years ago

@wolbersm I have a theoretical question. Let's denote by theta the estimate of the parameter of interest using all subjects and theta_m the average of all estimates that come from the leave-one-out samples. theta and theta_m differs in general.

The jackknife variance estimator is (N-1)/N * sum( (theta_i - theta_m)^2 ). If we have failures we don't have some estimates of some leave-one-out samples. That is, we do not have theta_m but we do have theta. The jackknife variance estimator in this case is:

Set all failed estimates equal to theta_max = max(abs(theta_i - theta)), i in set of non-failed estimates.
Now we have estimates for all leave-one-out samples. We can apply the formula (N-1)/N * sum( (theta_i - mean(theta_i))^2 ) (this time the index i includes also the "estimates" derived at point 2).

Questions:

Is this procedure what you meant?
This "conservative jackknife variance estimator" at point 1 maximizes the distance from theta and not from mean(theta_i) i in set of non-failed estimates. Is this the correct way to go? Does this maximization provide the same theta_max of max(abs(theta_i - theta_m)), i = set of non-failed estimates if I would have observed theta_m?

Thanks!

wolbersm commented 2 years ago

@nociale I think in step 1, one would have to set the estimates in the failed datasets to theta_max=theta+max(abs(theta_i-theta)) instead. Does this make sense?

nociale commented 2 years ago

@wolbersm Sure, that's my mistake! Maybe more precisely it should be theta_max = theta_j where j = argmax_i(abs(theta_i - theta)).

Last question: is it true that this estimator is reasonably conservative only for low number of failures? If the acceptable number of failures is high, then (theta_i - mean(theta_i))^2 might shrink since essentially many theta_i have all the same value as computed at step 1.

wolbersm commented 2 years ago

@nociale @gowerc As discussed at our meeting today, we decided not to implement this, because we felt that if a jackknife sample fails, this may indicate some underlying data problems and we should not gloss over this with an ad-hoc fix.

@gowerc also suggested that if the jackknife fails, the error message includes a list of patient id's which cause a failure in jackknife samples if left out. This will be added to the error message.

nociale commented 2 years ago

@gowerc @wolbersm I am wondering that, if the error message should include all the patients ids that cause a failure, rbmi should not stop running until when MMRM is fitted in all the leave-one-out samples. This may be very inefficient e.g. in case a failed sample occur in one of the first leave-one-out samples and the sample size is large. Should we better include in the error message only the patient id left out in the first sample that caused the failure (and thus stop the execution after observing the first failure)?

wolbersm commented 2 years ago

@nociale I agree that including in the error message only the patient id left out in the first sample that caused the failure (and thus stop the execution after observing the first failure) should be sufficient.

gowerc commented 2 years ago

@nociale , yer I think we should have fast failure, just throw the error and list the id of the first sample to fail then abort processing any further

insightsengineering / rbmi

Add conservative adjustment if get_jackknife_draws fails for a leave-one-subject-out dataset. #238