Request: Perform joint test across specification curves

hp2500 commented 3 years ago

Hi there,

some authors suggest to run an additional inference statistical test on the distribution of effect sizes and test statistics across specification curves. E.g. mean/median effect size different from zero, share of significant results higher than what would be expected under H0, average test statistic different from what would be expected under H0.

Source: Simonsohn, Uri, Joseph P. Simmons, and Leif D. Nelson. “Specification Curve Analysis.” Nature Human Behaviour 4, no. 11 (November 2020): 1208–14. https://doi.org/10.1038/s41562-020-0912-z.

What is your opinion on this? Would it make sense to implement (some of) these tests in specr? Can you suggest any current workarounds? Your response would be greatly appreciated.

Best wishes, Heinrich

masurp commented 3 years ago

Hi Heinrich,

thanks for raising this point. We have been thinking a lot about implementing this "third" step as outlined by Simonsohn et al. At the time, we have been concluding that there is still too much uncertainty about when such joint inferences make sense (see e.g., the very recent discussion by Del Giudice & Gangestad, 2021 of differences between truly arbitrary and non-arbitrary decision and the limits of multiverse analyses: https://journals.sagepub.com/doi/full/10.1177/2515245920954925). To quote some of these concerns:

"By inflating the size of the analysis space, the combinatorial explosion of unjustified specifications may, ironically, exaggerate the perceived exhaustiveness and authoritativeness of the multiverse while greatly reducing the informative fraction of the multiverse. At the same time, the size of the specification space can make it harder to inspect the results for potentially relevant findings. If unchecked, multiverse-style analyses can generate analytic “black holes”: massive analyses that swallow true effects of interest but, because of their perceived exhaustiveness and sheer size, trap whatever information is present in impenetrable displays and summaries."

We were (and still are) a bit concerned that tools for such an inference test could lead to misuse or wrongly conducted robustness test. That said, I will reconsider implementing such tools in future versions of the package. For a workaround, I would suggest inspecting the code by Simonsohn et al. (https://osf.io/9rvps/).

Best regards, Philipp

masurp commented 8 months ago

Some news!

I worked on a preliminary solution to this. There is a new function called boot_null (see here), which creates a new specr.boot object that contains all relevant parameters and refittings under-the-null. Here is how you can use it:

library(specr)

Set up a specific extraction functions that extracts the entire model object per specification

# Requires to keep full model
tidy_full <- function(x) {
fit <- broom::tidy(x, conf.int = TRUE)
fit$res <- list(x)  # Store model object
return(fit)
}

Setup your specifications and add the function to fun1.

specs <- setup(data = example_data,
   y = c("y1", "y2"),
   x = c("x1", "x2"),
   model = "lm",
   controls = c("c1", "c2"),
   fun1 = tidy_full)

Run the standard analysis

results <- specr(specs)

Refit the models under-the-null

I am just using 10 bootstrap samples to save time, but you should go for 1,000 or even 10,000. Careful, requires quite some time, but you can parallelize!

# Run bootstrapping
boot_models <- boot_null(results, specs, n_samples = 10) # better 1,000!
boot_models

## Results of bootstrapping 'under-the-null' procedure
## -------------------
## Technical details:
## 
##   Class:                           specr.boot -- version: 1.0.1 
##   Cores used:                      1 
##   Duration of fitting process:     20.839 sec elapsed 
##   Number of bootstrapped samples:  10 
## 
## Descriptive summary of the specification curves 'under-the-null' (head):
## 
##           id median   min  max
##  Bootstrap01  -0.02 -0.11 0.04
##  Bootstrap02   0.03 -0.08 0.07
##  Bootstrap03   0.01 -0.11 0.10
##  Bootstrap04  -0.03 -0.09 0.03
##  Bootstrap05  -0.02 -0.11 0.07
##  Bootstrap06  -0.04 -0.10 0.03
## 
## 
## Overall median across all resamples (should be close to NULL):
## 
##  median   min   max
##   -0.02 -0.02 -0.02

The resulting object now includes 10 resampled versions of the specification curve under the assumptions that the effect of interest is null. You get short descriptive analyses of each resampling step here and it clearly shows that they all (close) to zero.

We can then use the summary function to get the inference statistics proposed by Simonsohn et al. (2021).

# Summarize findings
summary(boot_models)

##   median median p share positive share positive p share negative share negative p
## 1   0.14   < .001         8 / 16           < .001         6 / 16             .100

Or use the plot function to get the figure they suggest and show in the paper:

# Plot under-the-null curves on top of specification curve
plot(boot_models)

Screenshot 2024-01-26 at 11 38 10

I might play around with this function a bit more and make it more stable, probably make the code more concise, etc. But for now, it seems to work well!

masurp commented 4 months ago

Function now officially included in the development version. A respective tutorial is also on the website: https://masurp.github.io/specr/articles/inferences.html

masurp / specr

Request: Perform joint test across specification curves #24

Some news!