Open HDembinski opened 4 years ago
Hmm, first thought is to accept that this would require someone to repeat the same calculation rather than adding complexity to the API. It's hard to say whether this would be so burdensome to users and also common enough that we would need a workaround built into the library (especially since the generators for jackknife and bootstrap samples will be exposed, meaning at worst someone would have to know the formulas and perform them manually on precalculated samples instead of using built-in functionality).
The formulas for the bootstrap bias and variance are trivial, but not for the jackknife. The library should not be designed so that I have to type formulas by hand. The whole point of a library is to avoid the kind of errors which manual typing of the formulas introduces. The interface should support both the simple user and the power-user.
I am all for having the simplest possible interfaces within the design constraints, but the interfaces should cover the relevant use cases. I think this is really essential. Computing the bootstrap replicas for an estimate can take minutes or much longer. We should not design a library that does not allow a user to reuse the same sample to compute bias, variance, and perhaps a confidence interval (I just had this case). This is in line with exposing the generators. The exposed generators allow the power-user to compute the estimator response to bootstrap replicas in parallel or on a cluster. The second step is then to allow the user to use these computed samples in the various statistical methods.
from concurrent.futures import ProcessPoolExecutor as Pool
from resample.jackknife import resample, bias, variance
x = ... # some large data set
def my_fn(x): ...
with Pool() as p:
fn_replicas = list(p.map(my_fn, resample(x)))
# not trivial formulas
fn_bias = bias(None, fn_replicas)
fn_variance = variance(None, fn_replicas)
Now the great thing about having a unified interface for bootstrap and jackknife is that I can switch to using the bootstrap instead of the jackknife by just changing one line to from resample.bootstrap import ...
The complexity added to the API is minor. You just pass None
instead of the function.
Maybe you feel uncomfortable that I am suggesting so many changes, but I think this one is really important to make the library complete. After that, I think we are through with the overhaul.
I started to use the library in practice and I found a caveat of our current design. Let's say I want to compute a bias correction and the variance of an estimator. If I naively call
resample.jackknife.variance
andresample.jackknife.bias_corrected
, it computes the jackknife estimates twice (which is expensive). The interface should allow me to reuse precomputed jackknife estimates (I am talking about the jackknife but the same is true for the bootstrap).I am not sure yet how to best achieve this. Here is my idea so far.
Currently, we have in
resample.jackknife
the signaturedef variance(fn, sample)
. It expects two mandatory arguments and I think that should not change. However, we could make it so that if one passesNone
forfn
, thensample
is interpreted as the precomputed replicas. This is not ambiguous, because fn is neverNone
under normal circumstances.This approach works for all jackknife tools, but
resample.bootstrap.confidence_level
adds further complications. More precisely, when the "student" and "bca" methods are used, the baseline idea does not work. The "student" method also needsfn(sample)
in addition to the replicas, and "bca" also needsfn(sample)
and jackknife replicas on top.I think the basic idea can still work, if we make the call to
confidence_interval
like thisAny thoughts?