guntherbensch / repframe

Stata package to calculate, tabulate and visualize Reproducibility and Replicability Indicators based on multiverse analyses.
2 stars 1 forks source link

REPFRAME v1.5.2

This is a Stata package to calculate, tabulate and visualize Reproducibility and Replicability Indicators. These indicators compare estimates from a multiverse of analysis paths of a robustness analysis — be they reproducibility or replicability analyses — to the original estimate in order to gauge the degree of reproducibility or replicability. The package comes with two commands: repframe is the main command, and repframe_gendata generates a dataset that is used in the help file of the command to show examples of how the command works.

The package can be installed in Stata by executing:

net install repframe, from("https://raw.githubusercontent.com/guntherbensch/repframe/main") replace

Once installed, please see

help repframe

for the syntax and the whole range of options.

As shown in the figure below and described in the following, the repframe command can be applied both to derive indicators at the level of individual studies and to pool these indicators across studies. At both levels, the command produces two outputs, a table with the main set of indicators (Reproducibility and Replicability Indicators table) as .csv or xlsx file and a so-called Robustness Dashboard that visualizes a second set of indicators. At the study level, the command additionally produces a Stata .dta file as a third output. This study-level indicator data is ready to be re-introduced into the command to calculate the indicators across studies.

repframe outputs  

Defaults applied by the repframe command

The repframe command applies a few default assumptions. Use the following options in case your analysis makes different assumptions.

Tests for statistical significance: The command applies two-sided t-tests to define which p-values imply statistical significance. These tests may apply different significance levels to the original results (siglevel_orig(#)) and to the robustness results (siglevel(#)). If the related p-values retrieved via the options pval(varname) and pval_orig(varname) are based on one-sided tests, these p-values need to be multiplied by two so as to make them correspond to a two-sided test. If no information on p-values is available, the command derives the missing p-value information applying the t-test formula. Depending on which additional information is available, this may be done based on t/z-scores (zscore(varname) and zscore_orig(varname)), on standard errors (se(varname) and se_orig(varname)) and degrees of freedom (df(varname) and df_orig(varname)), or on standard errors assuming a normal distribution. Remember that the latter may not always be appropriate, for example with small samples or when estimations have few degrees of freedom because they account for survey sampling, e.g. via the Stata command svy:, or when p-values are derived using randomisation inference. Conversely, if input data on p-values is based on other distributional assumptions than normality, the formula may not correctly derive standard errors. It is therefore recommended to specify both the information on p-values and on standard errors, and to consider the implications if non-normality is assumed in either the original or robustness analysis.

[!IMPORTANT]
Replicators of the Robustness Reproducibility in Economics(R2E) project should always apply a 5% significance level to the robustness results, i.e. siglevel(5).

Units in which effect sizes are measured: The command assumes that effect sizes of original results and robustness results are measured in the same unit. If this is not the case, for example because one is measured in log terms and the other is not, use the option sameunits(varname). This option requires a numerical variable varname containing the observation-specific binary information on whether the two are measured in the same unit (1=yes) or not (0=no).

Original analysis to be included as one robustness analysis path: The command assumes that the original analysis is not to be included as one analysis path in the multiverse robustness analysis. Otherwise specify the option orig_in_multiverse(1). Then, the original result is incorporated in the computation of three of the variation indicators ($I{4}$, $I{5}$, and $I´_{3}$). Irrespective of whether the original analysis is included as one robustness analysis path or not, the dataset should only include the information on the original analysis in the variables imported via the options ending with _orig, and not as a separate robustness analysis path.

Required input data structure

Data structure for analyses at study level

The input data at study level needs to be in a specific format for repframe to be able to calculate the indicators and dashboards. Each observation should represent one analysis path, that is the combination of analytical decisions in the multiverse robustness analysis. In the toy example with one main outcome represented in the below figure, two alternative choices are assessed for one analytical decision (analytical_decision_1, e.g. a certain adjustment of the outcome variable) and three alternative choices are assessed for two other analytical decisions (analytical_decision_2 and analytical_decision_3, e.g. the set of covariates and the sample used). This gives a multiverse of 3^2*2^1 = 18 analysis paths, if all combinations are to be considered. The number of observations is therefore 18 in this example.

For each observation, the minimum requirement is that the variable mainlist (this is the outcome at the study level) is defined together with the coefficient information retrieved via the options beta(varname) and beta_orig(varname) and information to determine statistical significance. It is recommended to specify both the information on p-values and on standard errors, as outlined above in the sub-section on defaults applied by the repframe command. As noted in that same sub-section, the dataset should furthermore not include observations with information on the original analysis as robustness analysis paths but only in the variables imported via the options ending with _orig. Also note that the variable mainlist should be numeric with value labels.

toy example of repframe multiverse input data structure  

The Stata help file contains a simple example that uses the command repframe_gendata to build such a data structure.

Data structure for analyses across studies

The repframe command can also be used to compile Reproducibility and Replicability Indicators across studies. To do so, one only has to append the study-level indicator data that include the Reproducibility and Replicability Indicators of individual studies and then feed them back into a variant of the repframe command. The following steps need to be taken:

  1. run repframe multiple times with individual studies to create the study-level indicator data saved as repframedata[fileidenfier].dta — with [fileidenfier] as defined by the option fileidentifier(string)
  2. append the individual study-level indicator data, making sure that all individual studies applied the same significance level, which can be checked with the variable siglevel contained in the study-level indicator data
  3. run the following commands to compile a dataset with Reproducibility and Replicability Indicators across studies
. encode ref, gen(reflist)
. drop ref
. order reflist
. save "[filename].dta", replace

  where [filename] can be freely chosen for the dataset containing all appended study-level indicator data, potentially including the full path of the file.

  1. run repframe again, now using the option studypool(1) to request the calculation of indicators across studies.

The Reproducibility and Replicability Indicators

The Reproducibility and Replicability Indicators table and the Robustness Dashboard present two separate sets of indicators. These indicators are primarily designed as easily and intuitively interpretable metrics for robustness tests. First of all, they are applicable to tests on robustness reproducibility, which asks to what extent results in original studies are robust to alternative plausible analytical decisions on the same data (Dreber and Johannesson 2023). This makes it plausible to assume that that the tests of robustness reproducibility and the original study measure exactly the same underlying effect size, with no heterogeneity and no difference in statistical power.

For tests of replicability using new data or alternative research designs, more sophisticated indicators are required to account for potential heterogeneity and difference in statistical power (cf. Mathur and VanderWeele 2020, Pawel and Held 2022).

The indicators are meant to inform about the following three pieces of information on reproducibility and replicability, related to either statistical significance or effect sizes:

For situations, in which an original study and a robustness analysis apply different classifications of what constitutes a statistically significant result, i.e. different levels of statistical significance ${\alpha}$, the Robustness Dashboard adds a

Moreover, the Robustness Dashboard additionally includes the option extended(string), which allows incorporating a

Reproducibility and Replicability Indicators table

The following describes the main indicators presented in the Reproducibility and Replicability Indicators table as they are computed at the level of each assessed outcome within a single study. Aggregation across outcomes at the study level is simply done by averaging the indicators as computed at outcome level, separately for outcomes reported as originally significant and outcomes reported as originally insignificant. Similarly, aggregation across studies is simply done by averaging the indicators as computed at study level. An example of a Reproducibility and Replicability Indicators table at study level is provided at the end of this section.

  1. The statistical significance indicator as a significance agreement indicator measures for each outcome $j$ the share of the $n$ robustness analysis paths $i$ that are reported as statistically significant or insignificant in both the original study and the robustness analysis. Accordingly, the indicator is computed differently for outcomes where the original results were reported as statistically significant and those where the original results were found to be statistically insignificant. Statistical significance is defined by a two-sided test with $\alpha^{orig}$ being the significance level applied in the original study and $\alpha$ being the significance level applied in the robustness analysis. For statistically significant original results, the effects of the robustness analysis paths must also be in the same direction as the original result, as captured by coefficients having the same sign or, expressed mathematically, by $\mathbb{I}(\beta_i \times \beta^{orig}_j \ge 0)$.

$$ I_{1j} = mean(\mathbb{I}(pval_i \le \alpha) \times \mathbb{I}(\beta_i \times \beta^{orig}_j \ge 0)) \quad \text{if } pval^{orig}_j \le \alpha^{orig} $$

$$ I_{1j} = mean(\mathbb{I}(pval_i > \alpha)) \quad \text{if } pval^{orig}_j > \alpha^{orig} $$

:point_right: This share indicator is intended to capture whether statistical significance in a robustness analysis confirms statistical significance in an original study. The indicator reflects a combination of technical agreement of results (do estimates agree in terms of achieving a certain level of statistical significance?) and classification agreement as introduced above (do estimates agree in terms of whether they are classified as statistically significant, given a potentially more or less demanding level of statistical significance applied by original authors?).

Interpretation: An indicator $I_{1j}$ of 0.3 for an outcome $j$ reported as statistically significant in the original study, for example, implies that 30\% of robustness analysis paths for this outcome (i) are statistically significant according to the significance level adopted in the robustness analysis, and (ii) that their coefficients share the same sign as the coefficient in the original study. Conversely, 70\% of robustness analysis paths for this outcome are most likely statistically insignificant, while it cannot be excluded that part of these paths are statistically significant but in the opposite direction. Note also that robustness analysis paths for this outcome may be found statistically insignificant — and thus non-confirmatory — only because of a stricter significance level adopted in the robustness analysis compared to the original study. An indicator of 0.3 for outcomes reported as statistically insignificant in the original study implies that 30\% of robustness analysis paths for this outcome are also statistically insignificant according to the significance level adopted in the robustness analysis. Now, the remaining 70\% of robustness analysis paths are statistically significant (most likely with the same sign), while a less strict significance level applied in the robustness analysis could now affect this indicator.

  1. The relative effect size indicator measures the mean of the coefficients $\beta_i$ of all robustness analysis paths for each outcome $j$ divided by the original coefficient $\beta^{orig}_j$. The indicator requires that the effect sizes of the original and robustness results are measured in the same units. It is furthermore only applied to outcomes reported as statistically significant in the original study, now — and for the following indicators as well — irrespective of whether in the same direction or not.

$$ I_{2j} = \frac{mean(\beta_i)} {\beta^{orig}_j} \quad \text{if } pval^{orig}_j \le \alpha^{orig} $$

$$ I_{2j} \text{ not applicable} \quad \text{if } pval^{orig}_j > \alpha^{orig} $$

:point_right: This ratio indicator is intended to capture how the size of robustness coefficients compares to the size of original coefficients.

Interpretation: An indicator $I{2j}$ above 1 implies that the mean of the coefficients of all the robustness analysis paths for a statistically significant original result on outcome $j$ is — in absolute terms — higher than the original coefficient (while both show in the same direction), with a factor of $I{2j}$ (e.g. 1.3). An indicator between 0 and 1 means that the mean coefficient in the robustness analysis paths is lower than the original coefficient (while both show in the same direction), again with a factor of $I{2j}$ (e.g. 0.7). An indicator below 0 implies that the two compared parameters have different signs. Here, the absolute value of the mean coefficient in the robustness analysis paths is higher (lower) than the absolute value of the original coefficient if $I{2j}$ is above (below) -1.

  1. The relative t/z-value indicator as a relative significance indicator measures for each outcome $j$ the mean of the t/z-values ($zscore_i$) of all the robustness analysis paths divided by the t/z-value of the original result. The indicator is also only derived for outcomes reported as statistically significant in the original study.

$$ I_{3j} = \frac{mean(zscore_i)} {zscore^{orig}_j} \quad \text{if } pval^{orig}_j \le \alpha^{orig} $$

$$ I_{3j} \text{ not applicable} \quad \text{if } pval^{orig}_j > \alpha^{orig} $$

:point_right: This ratio indicator is intended to capture how the statistical significance of robustness results compares to the statistical significance of original results.

Interpretation: An indicator $I{3j}$ above (below) 1 means that the average t/z-value of all robustness analysis paths for outcome $j$ is — in absolute terms — higher (lower) than the original coefficient, suggesting a higher (lower) level of statistical significance in the robustness analysis. An indicator below 0 additionally implies that the two compared parameters have different signs, where the absolute value of the mean t/z-value in the robustness analysis paths is higher (lower) than the absolute value of the original t/z-value if $I{3j}$ is above (below) -1.

  1. The effect size variation indicator measures for each outcome $j$ the standard deviation $sd$ of all robustness coefficients divided by the standard error $se$ of the original coefficient. Here, the $\beta_i$ may incorporate the original result as one robustness analysis path. The indicator requires that the effect sizes of the original and robustness results are measured in the same units.

$$ I_{4j} = \frac{sd(\beta_i)}{se(\beta^{orig}_j)} $$

applied separately to $pval^{orig}_j \le \alpha^{orig}$ and $pval^{orig}_j > \alpha^{orig}$.

:point_right: This ratio indicator is intended to capture how the variation in coefficients of robustness results compares to the variation estimated for the original coefficient.

Interpretation: An indicator $I{4j}$ above (below) 1 means that variation across all robustness analysis paths for outcome $j$ is higher (lower) than the variation estimated for the original result, with a factor of $I{4j}$.

  1. The t/z-value variation indicator as a significance variation indicator measures the standard deviation of t/z-values of all the robustness analysis paths for each outcome $j$. Here, the $zscore_i$ may incorporate the original result as one robustness analysis path.

$$ I_{5j} = sd(zscore_i) $$

applied separately to $pval^{orig}_j \le \alpha^{orig}$ and $pval^{orig}_j > \alpha^{orig}$.

:point_right: This absolute indicator is intended to capture the variation in the statistical significance across robustness results.

Interpretation: $I_{5j}$ simply reports the standard deviation of t/z-values of all the robustness analysis paths for outcome $j$ as a measure of variation in statistical significance. Higher values indicate higher levels of variation.

The following shows an example of the Reproducibility and Replicability Indicators table, indicating the five indicators as outlined above. The indicators are grouped by whether the original result for the respective outcome was reported as statistically significant or not. Each of these two sets of outcomes also includes the average across the respective outcomes.

repframe indicators table example    

Robustness Dashboard

A key feature of the dashboard is that indicators are tailored to specific sub-groups of analysis paths. It distinguishes between whether original and robustness results are statistically significant or not, and by whether significant robustness results have the same or an opposite sign when compared to the original result. For example, the relative effect size indicator is only derived for robustness analysis paths that are statistically significant and in the same direction as the original result. The idea is to restrict the indicator to those analysis paths, for which it is most meaningful (instead of averaging the indicator across all analysis paths for the respective outcome), and limit the information for the other analysis paths — those that are statistically significant or statistically significant but in the opposite direction — to a more simple and parsimonious set of indicators.

The dashboard includes up to nine indicators. The core set is composed of four default indicators, $I´_1$ to $I´_4$, with two conditional indicators, $I´_5$ and $I´_6$, and an extended version of the dashboard additionally includes indicators $I´_7$ to $I´_9$.
A general difference to the indicators included in the Reproducibility and Replicability Indicators table is that the same level of statistical significance is applied to original and robustness results. The motivation is to separate technical and classification agreement of results as defined above and outlined in the description of the first two indicators.

In the same vein as for indicators presented in the Reproducibility and Replicability Indicators table, aggregation across outcomes at the study level (across studies) is simply done by averaging the indicators as computed at outcome (study) level, separately for originally significant and originally insignificant outcomes. An example of a Robustness Dashboard at study level is provided at the end of this section.

  1. The significance agreement indicator is derived for each outcome $j$ in a similar way as the statistical significance indicator from the Reproducibility and Replicability Indicators table. The only differences are that (i) the indicator is the same for statistically significant and insignificant robustness results and that (ii) the same significance level $\alpha$ is applied to the original results and to the robustness results. The indicator is expressed in \% of all robustness results on either statistically significant or insignificant original results and, hence, additionally multiplied by 100. For statistically significant robustness results with same sign, the indicator is calculated as follows:

$$ I´_{1j} = mean(\mathbb{I}(pval_i \le \alpha) \times \mathbb{I}(\beta_i \times \beta^{orig}_j \ge 0)) \times 100 $$

applied separately to $pval^{orig}_j \le \alpha$ and $pval^{orig}_j > \alpha$. The same indicator is also calculated for statistically significant robustness results with opposite sign, i.e. differing from the above formula through $\mathbb{I}(\beta_i \times \beta^{orig}_j < 0)$. For statistically insignificant robustness results, the indicator corresponds to 100 minus these two indicators on statistically significant results with same and opposite sign.

:point_right: This proportion indicator is intended to capture the technical agreement of results (are estimates robust in terms of achieving a certain level of statistical significance?).

Interpretation: An indicator $I´{1j}$ of 30\% implies that 30\% of robustness analysis paths for outcomes $j$ are statistically significant. Depending on which of the four sub-indicators of the Robustness Dashboard one is referring to, this refers to (i) statistically significant or insignificant original results and to (ii) original and robustness coefficients that share or do not share the same sign. For example, if $I´{1j}$ is 30\% for results with the same sign and 3\% for results with opposite signs, the remaining 67\% of robustness analysis paths for this outcome are statistically insignificant. The significance levels applied to the original study and the robustness analysis are identical and correspond to the one defined in the robustness analysis.

  1. The relative effect size indicator differs from $I_{2j}$ from the Reproducibility and Replicability Indicators table in that it is only derived for robustness analysis paths that are (i) statistically significant and (ii) in the same direction as the original result. In addition, the indicator takes the median of the robustness coefficients instead of the mean, in order to be less sensitive to outliers. Furthermore, one is subtracted from the ratio, in order to underscore the relative nature of the indicator. A ratio of 2/5 thus turns into -3/5, and multiplied by 100 to -60\%.

$$ I´_{2j} = (\frac{median(\beta_i)} {\beta^{orig}_j} - 1) \times 100 \quad \text{if } pval^{orig}_j \le \alpha \land pval_i \le \alpha \land \beta_i \times \beta^{orig}_j \ge 0 $$

$$ I´_{2j} \text{ not applicable otherwise} $$

:point_right: This ratio indicator is intended to capture how effect sizes of robustness analyses compare to the original effect sizes. The indicator focuses on the case where a comparison of effect sizes is most relevant and interpretable, that is when both the original and robustness results are statistically significant and in the same direction.

Interpretation: An indicator $I´{2j}$ for outcome $j$ with an originally significant result (below) 0\% means that the mean of statistically significant robustness coefficients in the same direction as the original result is higher (lower) than the original coefficient, by $I´{2j}$\% — e.g. +30\% (-30\%).

The Robustness Dashboard does not include a relative significance indicator.

  1. The effect size variation indicator measures the mean absolute deviation of coefficients in robustness analysis paths from their median. Like $I´{2j}$, it only considers robustness analysis paths for outcomes reported as statistically significant that are (i) statistically significant and (ii) in the same direction as the original result. The mean value is divided by the original coefficient and multiplied by 100 so that it is measured in the same unit as $I´{2j}$. Here, the $\beta_i$ may incorporate the original result as one robustness analysis path. The indicator requires that the effect sizes of the original and robustness results are measured in the same units.

$$ I´_{3j} = \frac{mean(\mid \beta_i - median(\beta_i) \mid)} {\beta^{orig}_j} \times 100 \quad \text{if } pval^{orig}_j \le \alpha \land pval_i \le \alpha \land \beta_i \times \beta^{orig}_j \ge 0 $$

$$ I´_{3j} \text{ not applicable otherwise} $$

:pointright: This ratio indicator is intended to capture how the variation in coefficients of robustness results compares to the variation in coefficients among original results. The indicator complements $I´{2j}$ focusing on the case of original and robustness results that are statistically significant and in the same direction.

Interpretation: An indicator $I´_{3j}$ of, for example, 10\% means that variation across robustness results for outcome $j$ is equivalent to 10\% of the original coefficient.

  1. The significance variation indicator measures the mean of the deviations between p-values from the robustness analysis paths and the original p-value. This indicator is always derived, except for robustness and original results that are both statistically significant, since the deviation is known to be small in that case.

$$ I´_{4j} = mean(\mid pval_i - pval^{orig}_j \mid) $$

applied separately to (i) $pval^{orig}_j \le \alpha \land pval_i > \alpha$, (ii) $pval^{orig}_j > \alpha \land pval_i > \alpha$, and (iii) $pval^{orig}_j > \alpha \land pval_i \le \alpha$.

:point_right: This absolute indicator is intended to capture the variation in statistical significance across robustness results that are or turned statistically insignificant.

Interpretation: An indicator $I´_{4j}$ of 0.2, for example, implies that p-values among certain robustness analysis paths for outcome $j$ on average differ by 0.2 from the original p-value. Depending on which of the three sub-indicators of the Robustness Dashboard one is referring to, this refers to the case of (i) a significant original result and insignificant robustness results, (ii) an insignificant original result and insignificant robustness results, or (iii) an insignificant original result and significant robustness results. Like p-values themselves, this deviation may assume values between 0 (very small deviation) and 1 (maximum deviation).

  1. The indicator on non-agreement due to significance classification is an indicator that focuses on classification robustness of results as defined above. It applies only in situations in which an original study applied a different — more or less stringent — classification of what constitutes a statistically significant result than the robustness analysis. Specifically, it identifies those originally significant (insignificant) results that have statistically insignificant (significant) robustness analysis paths only because a more (less) stringent significance level definition is applied in the robustness analysis than in the original study. The indicator is also expressed in \% and therefore includes the multiplication by 100. For the case where a more stringent significance level definition is applied in the robustness analysis, the indicator is calculated as follows.

$$ I´_{5j} = mean(\mathbb{I}(\alpha < pval_i \le \alpha^{orig})) \times 100 \quad \text{if } \alpha < pval^{orig}_j \le \alpha^{orig} $$

$$ I´_{5j} \text{ not applicable otherwise} $$

In the opposite case, with a less stringent significance level definition applied in the robustness analysis, the same formula applies with opposite signs.

:point_right: This proportion indicator is intended to capture non-robustness of results reported as (in)significant in original studies that is due to differences in the classification of statistical significance.

Interpretation: Consider the case where the robustness analysis paths apply a significance level of 5\% and the original analysis applied a less strict significance level of 10\%. In this case, robustness results with $0.05 < pvali \le 0.10$ are only categorized as insignificant and thus having a non-agreeing significance level because of differing definitions of statistical significance. An indicator $I´{5j}$ of 10\%, for example, implies that this holds true for 10\% of robustness analysis paths for outcome $j$.

  1. The significance classification agreement indicator aggregates the information on significance classification agreement between the robustness results and the original results.

$$ I'{6j} = I´{1j}^{ssign} \times 100 \quad \text{if } pval^{orig}_j \le \alpha^{orig} $$

$$ I'{6j} = (1-I´{1j}^{ssign} - I´_{1j}^{nsign}) \times 100 \quad \text{if } pval^{orig}_j > \alpha^{orig} $$

where ssign refers to $I´_{1j}$ when derived for robustness results with the same sign, and nsign when derived for robustness results with the opposite sign.

The indicator presented in the Robustness Dashboard is the average across all outcomes or studies.

$$ I'{6} = mean(I'{6j}) $$

This is different from the other indicators, as it is not differentiated by whether outcomes are originally significant or insignificant.

In cases where the robustness analysis and original study or studies applied different significance levels, the Robustness Dashboard additionally shows this indicator when applying a uniform significance level, that is when the formulae include $\alpha$ instead of $\alpha^{orig}$. Both indicators have their advantages and disadvantages. Consider the example with $pval^{orig}_j$=0.07, $\alpha$=0.05, and $\alpha^{orig}$=0.10. Here, the former indicator would categorize robustness analysis paths with equal p-values of $pval_i$=0.07 as non-confirmatory, whereas the latter indicator would categorize robustness analysis paths with lower p-values of $pval_i$=0.04 as non-confirmatory, both of which can be seen as contrary to common intuition. It is therefore generally recommended to use the same significance level in a robustness analysis as in the original study or studies (if the latter differ among each other, the less stringent significance level is to be chosen).

:point_right: This proportion indicator is intended to capture to which degree statistical significance as reported in original studies is confirmed through the robustness analyses — where the classification of statistical significance may differ from that of the original study or not.

Interpretation: An indicator $I´_{6}$ of 80\% implies that the classification into significant or insignificant in robustness analysis paths confirms the classification by original authors in 80\% when averaged over individual outcomes (studies).

Extension of the Robustness Dashboard

The Robustness Dashboard additionally includes the option extended(string) to show two types of indicators in an extended set of indicators.

  1. The effect size agreement indicator measures the share of robustness coefficients that lie inside the bounds of the confidence interval of the original coefficient, $\beta(cilo)^{orig}_j$ and $\beta(ciup)^{orig}_j$. It only considers statistically insignificant robustness analysis paths for outcomes reported as statistically significant in the original study. The indicator requires that the effect sizes of the original and robustness results are measured in the same units.

$$ I´_{7j} = mean(\mathbb{I}(\beta(cilo)^{orig}_j \le \beta_i \le beta(ciup)^{orig}_j)) \times 100 \quad \text{if } pval^{orig}_j \le \alpha \land pval_i > \alpha $$

$$ I´_{7j} \text{ not applicable otherwise} $$

:point_right: This proportion indicator is intended to complement the significance agreement indicator and thereby to capture technical agreement of results not only in terms of achieving a certain but arbitrary level of statistical significance, but also in terms of showing similarity of coefficients.

Interpretation: An indicator $I´_{7j}$ of 10\% implies that 10\% of robustness analysis paths for this outcome $j$ with originally significant results are insignificant according to the significance level adopted by the robustness analysis, but with robustness coefficients that cannot be rejected to lie inside the confidence interval of the original result. The closer these 10\% are to the share of statistically insignificant robustness analysis paths for this outcome, the less does this indicator confirm the statistical significance indicator. For example, if the share of statistically insignificant robustness analysis paths for this outcome is 15\%, two-thirds of these analysis paths are non-confirmatory according to statistical significance indicator and confirmatory according to the effect size agreement indicator.

  1. & 9. The significance switch indicators include two sub-indicators for originally significant and insignificant results, respectively. For originally significant results, these indicators measure the share of robustness coefficients (standard errors) that are sufficiently small (large) to have turned the result insignificant when standard errors (coefficients) are held at their values in the original study. Whether absolute values of coefficients (standard errors) are sufficiently small (large) is determined based on the threshold values $\beta(tonsig)_j$ and $se(tonsig)_j$. The indicators require that the effect sizes of the original and robustness results are measured in the same units.

$$ I´_{8j} = mean(\mathbb{I}(\mid \beta_i \mid \le \beta(tonsig)_j)) \times 100 \quad \text{if } pval^{orig}_j \le \alpha \land pval_i > \alpha $$

$$ I´_{9j} = mean(\mathbb{I}(se_i \ge se(tonsig)_j)) \times 100 \quad \text{if } pval^{orig}_j \le \alpha \land pval_i > \alpha $$

The indicators for originally insignificant results are a mirror image of those for originally significant results: now the indicators measure the shares of robustness coefficients (standard errors) that are sufficiently large (small) to have turned results significant, applying threshold values $\beta(tosig)_j$ and $se(tosig)_j$, respectively.

$$ I´_{8j} = mean(\mathbb{I}(\mid \beta_i \mid > \beta(tosig)_j)) \times 100 \quad \text{if } pval^{orig}_j > \alpha \land pval_i \le \alpha $$

$$ I´_{9j} = mean(\mathbb{I}(se_i < se(tosig)_j)) \times 100 \quad \text{if } pval^{orig}_j > \alpha \land pval_i \le \alpha $$

:point_right: These proportion indicators are intended to capture the drivers behind changes in statistical significance between original study and robustness analysis.

Interpretation: An indicator $I´_{8j}$ of, for example, 30\% for an outcome $j$ with an originally significant result, implies that 30\% of the robustness analysis paths that are statistically insignificant have coefficients that are sufficiently small for the robustness analysis path to be statistically insignificant even if the standard error would be identical to the one in the original study. The other (sub-)indicators can be interpreted analogously.

The Dashboard output

The following shows an example of the Robustness Dashboard, indicating where the indicators outlined above can be found in the figure. Indicators from the extended set are in lighter blue. The vertical axis of the dashboard shows individual outcomes, grouped into statistically significant and insignificant if aggregated. Note that this grouping may differ from the one in the Reproducibility and Replicability Indicators table, because that table applies to original results the significance level defined by original authors, whereas the dashboard applies the same significance level as adopted in the robustness analysis. The horizontal axis distinguishes between statistically significant and insignificant robustness results, additionally differentiating between statistically significant results in the same and in opposite direction. Circle sizes illustrate $I´{1j}$, the significance agreement indicator. They are coloured in either darker blue for confirmatory results or in lighter blue for non-confirmatory results. As can be seen with Outcome 3 in the figure, this colouring also discriminates $I´{1j}$ from $I´_{5j}$, the indicator on non-agreement due to significance classification.

When aggregating across outcomes or studies, the bottom of the dasboard additionally includes a histogram with the share of confirmatory results and absolute values of effect sizes.

In the results window of Stata, the repframe command provides as additional information the (minimum and maximum) number specifications that have been used to derive the dashboard indicators.

repframe Robustness Dashboard example  

 

repframe Robustness Dashboard example, aggregated  

Summary

The following table summarizes which indicators are included in the Reproducibility and Replicability Indicators table and the Robustness Dashboard.

Type of indicator Reproducibility and Replicability Indicators table Robustness Dashboard Symbol in Dashboard
significance (sig.) sig. agreement $I_{1}$ $I´_{1}$ (main figure in dashboard - no symbol)
relative sig. $I_{3}$ - -
sig. variation $I_{5}$ $I´_{4}$ $\overline{\Delta p}$ (mean abs. var. of p-value)
sig. classification agreement - $I´_{5}$ (if different sig. levels) p $\le$ ${\alpha}^o$ (less stringent sig. level applied in original study) or p > ${\alpha}^o$ (more stringent sig. level applied in original study)
overall sig. (and sig. classification) agreement - $I´_{6}$ (if aggregated) $\overline{\kappa}$ (mean share of confirmatory results)
sig. switch - $I´{8}$ \& $I´{9}$ (ext.) high/ low ${|\beta|}$ (abs. value of ${\beta}$) and ${se}$
effect size (e.s.) e.s. agreement - $I´_{7}$ (ext.) ${\beta}$ in ${CI}(\beta^o)$ (confidence interval of orig. ${\beta}$)
relative e.s. $I_{2}$ $I´_{2}$ $\widetilde{\beta}$ (median ${\beta}$)
e.s. variation $I_{4}$ $I´_{3}$ $\overline{\Delta\beta}$ (mean abs. var. of ${\beta}$)

Update log

2024-06-03, v1.5.2:

2024-03-17, v1.5.1:

2024-03-05, v1.5:

2024-03-04, v1.4.2:

2024-02-29, v1.4.1:

2024-02-28, v1.4:

2024-02-13, v1.3.1:

2024-01-22, v1.3:

2024-01-19, v1.2:

2024-01-18, v1.1:

References

Dreber, A. & Johanneson, M. (2023). A Framework for Evaluating Reproducibility and Replicability in Economics. Available at SSRN.

Mathur, M. B., & VanderWeele, T. J. (2020). New statistical metrics for multisite replication projects. Journal of the Royal Statistical Society Series A: Statistics in Society183(3), 1145-1166.

Pawel, S., & Held, L. (2022). The sceptical Bayes factor for the assessment of replication success. Journal of the Royal Statistical Society Series B: Statistical Methodology84(3), 879-911.