Closed wolbersm closed 2 years ago
@wolbersm I have a theoretical question. Let's denote by theta
the estimate of the parameter of interest using all subjects and theta_m
the average of all estimates that come from the leave-one-out samples. theta
and theta_m
differs in general.
The jackknife variance estimator is (N-1)/N * sum( (theta_i - theta_m)^2 )
. If we have failures we don't have some estimates of some leave-one-out samples. That is, we do not have theta_m
but we do have theta
. The jackknife variance estimator in this case is:
theta_max = max(abs(theta_i - theta)), i in set of non-failed estimates
. (N-1)/N * sum( (theta_i - mean(theta_i))^2 )
(this time the index i
includes also the "estimates" derived at point 2).Questions:
theta
and not from mean(theta_i) i in set of non-failed estimates
. Is this the correct way to go? Does this maximization provide the same theta_max
of max(abs(theta_i - theta_m)), i = set of non-failed estimates
if I would have observed theta_m
?Thanks!
@nociale I think in step 1, one would have to set the estimates in the failed datasets to theta_max=theta+max(abs(theta_i-theta)) instead. Does this make sense?
@wolbersm Sure, that's my mistake! Maybe more precisely it should be theta_max = theta_j
where j = argmax_i(abs(theta_i - theta))
.
Last question: is it true that this estimator is reasonably conservative only for low number of failures? If the acceptable number of failures is high, then (theta_i - mean(theta_i))^2
might shrink since essentially many theta_i
have all the same value as computed at step 1.
@nociale @gowerc As discussed at our meeting today, we decided not to implement this, because we felt that if a jackknife sample fails, this may indicate some underlying data problems and we should not gloss over this with an ad-hoc fix.
@gowerc also suggested that if the jackknife fails, the error message includes a list of patient id's which cause a failure in jackknife samples if left out. This will be added to the error message.
@gowerc @wolbersm I am wondering that, if the error message should include all the patients ids that cause a failure, rbmi should not stop running until when MMRM is fitted in all the leave-one-out samples. This may be very inefficient e.g. in case a failed sample occur in one of the first leave-one-out samples and the sample size is large. Should we better include in the error message only the patient id left out in the first sample that caused the failure (and thus stop the execution after observing the first failure)?
@nociale I agree that including in the error message only the patient id left out in the first sample that caused the failure (and thus stop the execution after observing the first failure) should be sufficient.
@nociale , yer I think we should have fast failure, just throw the error and list the id of the first sample to fail then abort processing any further
For method_condmean(type="jackknife"), it may occur that the estimator fails for only one or very few leave-one-subject-out datasets. Currently, the method fails altogether in this setting and this is ok for 99.9% of the cases, as this should be extremely rare. However, it still seems reasonable to build in a relatively conservative handling of this "worst case scenario" to avoid complete failure.
My proposal is to: a) Add a tolerance threshold which by default is 0 but allows up to x% jackknife failures. b) If the MMRM/estimator fails on a leave-one-subject-out datasets, treat it as follows: