Need a way to pass precalculated replicas

HDembinski commented 4 years ago

I started to use the library in practice and I found a caveat of our current design. Let's say I want to compute a bias correction and the variance of an estimator. If I naively call resample.jackknife.variance and resample.jackknife.bias_corrected, it computes the jackknife estimates twice (which is expensive). The interface should allow me to reuse precomputed jackknife estimates (I am talking about the jackknife but the same is true for the bootstrap).

I am not sure yet how to best achieve this. Here is my idea so far.

Currently, we have in resample.jackknife the signature def variance(fn, sample). It expects two mandatory arguments and I think that should not change. However, we could make it so that if one passes None for fn, then sample is interpreted as the precomputed replicas. This is not ambiguous, because fn is never None under normal circumstances.

This approach works for all jackknife tools, but resample.bootstrap.confidence_level adds further complications. More precisely, when the "student" and "bca" methods are used, the baseline idea does not work. The "student" method also needs fn(sample) in addition to the replicas, and "bca" also needs fn(sample) and jackknife replicas on top.

I think the basic idea can still work, if we make the call to confidence_interval like this

thetas = bootstrap(my_fn, data)
theta = my_fn(data)
j_thetas = jackknife(my_fn, data)
confidence_interval(None, thetas, ci_method="percentile") # ok, works
confidence_interval(None, (thetas, theta), ci_method="student") # ok, additional information passed as tuple
confidence_invertal(None, (thetas, theta, j_thetas), ci_method="bca") # ok, same

Any thoughts?

dsaxton commented 4 years ago

Hmm, first thought is to accept that this would require someone to repeat the same calculation rather than adding complexity to the API. It's hard to say whether this would be so burdensome to users and also common enough that we would need a workaround built into the library (especially since the generators for jackknife and bootstrap samples will be exposed, meaning at worst someone would have to know the formulas and perform them manually on precalculated samples instead of using built-in functionality).

HDembinski commented 4 years ago

The formulas for the bootstrap bias and variance are trivial, but not for the jackknife. The library should not be designed so that I have to type formulas by hand. The whole point of a library is to avoid the kind of errors which manual typing of the formulas introduces. The interface should support both the simple user and the power-user.

I am all for having the simplest possible interfaces within the design constraints, but the interfaces should cover the relevant use cases. I think this is really essential. Computing the bootstrap replicas for an estimate can take minutes or much longer. We should not design a library that does not allow a user to reuse the same sample to compute bias, variance, and perhaps a confidence interval (I just had this case). This is in line with exposing the generators. The exposed generators allow the power-user to compute the estimator response to bootstrap replicas in parallel or on a cluster. The second step is then to allow the user to use these computed samples in the various statistical methods.

from concurrent.futures import ProcessPoolExecutor as Pool
from resample.jackknife import resample, bias, variance
x = ... # some large data set

def my_fn(x): ...

with Pool() as p:
    fn_replicas = list(p.map(my_fn, resample(x)))

# not trivial formulas
fn_bias = bias(None, fn_replicas)         
fn_variance = variance(None, fn_replicas)

Now the great thing about having a unified interface for bootstrap and jackknife is that I can switch to using the bootstrap instead of the jackknife by just changing one line to from resample.bootstrap import ...

HDembinski commented 4 years ago

The complexity added to the API is minor. You just pass None instead of the function.

HDembinski commented 4 years ago

Maybe you feel uncomfortable that I am suggesting so many changes, but I think this one is really important to make the library complete. After that, I think we are through with the overhaul.

scikit-hep / resample

Need a way to pass precalculated replicas #34