raphaelvallat / pingouin

Statistical package in Python based on Pandas
https://pingouin-stats.org/
GNU General Public License v3.0
1.64k stars 139 forks source link

Use scipy.stats.bootstrap in pingouin.compute_bootci? #189

Open raphaelvallat opened 3 years ago

raphaelvallat commented 3 years ago

pingouin.compute_bootci could leverage the newly-added scipy.stats.bootstrap function (SciPy 1.7.0+) to calculate the bootstrapped confidence intervals.

Pros

Cons

FlorinAndrei commented 2 years ago

I think return_dist could be simulated with a wrapper function that calls np.mean() or whatever, and appends the partial results to a global list at each step. The wrapper function is then passed as a value to statistic. At the end, the global list becomes the bootstrapped distribution. If that sounds right, could you add this suggestion to the docs please?

The BCa being only available for univariate statistics is not a deal breaker, but kinda annoying. I've hit that issue a few times. Maybe I'll stick a feature request into Scipy.

EDIT: https://github.com/scipy/scipy/issues/16433

Thank you for all the improvements.

raphaelvallat commented 2 years ago

Thanks for opening the issue on SciPy! I think that scipy's bootstrap module is still under development so I'll stick to Pingouin's internal implementation for now.

I'm not sure to understand your method to get the bootstrapped distribution, could you please provide some example code?

Thanks, Raphael

FlorinAndrei commented 2 years ago

Sure. I did this recently with Scipy:

Data set:

BirdPecks.csv

Code:

import pandas as pd
import numpy as np
from scipy import stats

BirdPecks = pd.read_csv("BirdPecks.csv")
Nboot = 10000

pg1 = BirdPecks[BirdPecks["group"] == 1]["pecks"].to_numpy()
pg2 = BirdPecks[BirdPecks["group"] == 2]["pecks"].to_numpy()

bdist = []
global bdist

def boot_mean(x, y):
    global bdist
    m = np.mean(x) - np.mean(y)
    bdist.append(m)
    return m

bs_out = stats.bootstrap(
    data=(pg1, pg2),
    statistic=boot_mean,
    vectorized=False,
    paired=False,
    confidence_level=0.9,
    n_resamples=Nboot,
    random_state=123,
    method="basic"
)

print(bdist)
FlorinAndrei commented 2 years ago

@raphaelvallat There are proposals for significant improvements to scipy.stats.bootstrap().

https://github.com/scipy/scipy/pull/16454

https://github.com/scipy/scipy/pull/16455

They're asking for extra comments, but right now I'm under a big time crunch, so I'm not sure I can help.

mdhaber commented 2 years ago

I'd be happy to add other enhancements to scipy.stats.bootstrap if it would make it more attractive!

As @FlorinAndrei mentioned, domain knowledge review of PRs would help - the main reason I haven't made enhancements already is shortage of reviewer time.

We can definitely return the bootstrap distribution in 1.10, and I now see no technical impediment to extending BCa to the multi-sample case.

I decided against returning a normal approximation confidence interval because Efron does not seem to recommended them. Of course, that would be super easy to add, either in a Pingouin wrapper or SciPy itself (if there's a good case made for it).

I'm currently writing a SciPy tutorial for all the resampling methods we offer (permutation_test and monte_carlo_test, too). Let me know if you'd like a preview!

raphaelvallat commented 2 years ago

Hi @mdhaber,

Thanks for all your work on the bootstrap module. I'd be happy to review a PR and/or tutorial 👍

I think that returning the bootstrapped distribution in SciPy would be great. If so, the current pingouin.compute_bootci would essentially become a wrapper around scipy.stats.bootstrap.

Two other potential ideas:

1) I think it might also be helpful to have public API for the lower-level confidence intervals estimation function, such that users can also calculate the CI from their own bootstrapped distribution. This is something that we can implement in pingouin.compute_bootci: the ability to pass as input either the raw array(s), or the bootstrapped distribution (e.g. pg.compute_bootci(bootdist, method="BCa")). 2) I do love the vectorized argument in scipy.stats.bootstrap, but I think for the (many) Python users that are not familiar with vectorization, this can be a little difficult to understand and use ( it took me a few trials to understand how to configure the axes, etc). Do you think that there might be a way to automatically detect if a function is vectorizable? Like if there is an axis keyword-argumnt, like in most numpy/scipy functions?

mdhaber commented 2 years ago

I'd be happy to review a PR and/or tutorial

Great! If you'd like to look at doc additions in scipy/scipy#16454, which returns the bootstrap distribution, I'd appreciate it! If you approve, be sure to select that radio button when you submit your review. As a maintainer, it certainly improves my confidence in merging when the PR has the approval of others.

I think it might also be helpful to have public API for the lower-level confidence intervals estimation function

OK, I'll be thinking about that.

Do you think that there might be a way to automatically detect if a function is vectorizable?

Yup, I can do that. Good idea. Update: see https://github.com/scipy/scipy/pull/16651.

mdhaber commented 2 years ago

@raphaelvallat Now that scipy/scipy#16651 addresses your idea 2 above, here are some thoughts about your idea 1.

In retrospect, I would have made confidence_interval a method so that users could do this, but originally, bootstrap was going to return only confidence_interval. When we made it return a result object and standard_error, we didn't reconsider. Oops.

But perhaps it's for the best: a really flexible way of doing this is to add bootstrap_distribution as a parameter of bootstrap. The idea is that bootstrap would behave exactly as it does now, but it would append bootstrap_distribution to the bootstrap distribution it calculates. To calculate a differenct confidence interval, the user would just pass bootstrap_distribution and n_resamples=0 into bootstrap. But besides allowing the user to calculate different sorts of confidence intervals, they could also use this to iteratively increase the number of bootstrap replications. What do you think?

Another possibility is to make the new bootstrap_distribution an object with a method for doing the same sort of thing as above, but I think I'd rather not do this because it would be a new API to document (even if it is mostly a copy of the bootstrap API).

raphaelvallat commented 2 years ago

Awesome @mdhaber, I just reviewed https://github.com/scipy/scipy/pull/16651!

a really flexible way of doing this is to add bootstrap_distribution as a parameter of bootstrap

I think that having a dedicated Bootstrap class, which can be initialized either with data + statistic, or bootstrap_distribution, and that contains methods such as confidence_intervals and standard_error is probably the cleanest. But what you suggest here is a good compromise. Will it also require passing data and statistic as well as bootstrap_distribution? I could see the case where the users only have the bootstrap distribution and not the underlying data or function.

Another possibility is to make the new bootstrap_distribution an object with a method for doing the same sort of thing as above, but I think I'd rather not do this because it would be a new API to document

I agree.

mdhaber commented 2 years ago

Will it also require passing data and statistic as well as bootstrap_distribution

Most CI methods need to know the observed value of the statistic to compute the confidence interval. I suppose I could allow data to default to None, in which case statistic could be the observed value instead of a function. In an initial PR, I would probably not implement this, but I'll think about it.

mdhaber commented 2 years ago

Just thought I'd mention that https://github.com/scipy/scipy/pull/16455 would add BCa bootstrap for multi-sample statistics. I think it's correct now, and I'm working on unit tests for it. Reviews appreciated!

mdhaber commented 2 years ago

And scipy/scipy#16714 adds the functionality suggested in https://github.com/raphaelvallat/pingouin/issues/189#issuecomment-1189750880 to address @raphaelvallat's idea 1. With that, one can change confidence level, CI type, and add resamples (or not) like:

res = stats.bootstrap((x,), np.mean, confidence_level=0.95, method='percentile')
# without any new resampling, change the confidence level and method
res2 = stats.bootstrap((x,), np.mean, n_resamples=0, bootstrap_result=res, confidence_level=0.9, method='BCa')
raphaelvallat commented 2 years ago

@mdhaber I reviewed the second PR. Unfortunately, I don't think I have the stats skills nor time to do a full in-depth review of https://github.com/scipy/scipy/pull/16455 — @FlorinAndrei might be better able to help here. But please let me know how I can be helping. Thanks!

mdhaber commented 2 years ago

No problem. Thanks for taking a look at scipy/scipy#16714!

mdhaber commented 2 years ago

The multi-sample BCa PR merged. Hope it helps!