Randomization inference, "ri" sampling_method in rwolf, gives too tight a sample of null t-statistics

marcandre259 commented 2 days ago

Possible issue I noticed while working on #698.

The behavior was initially noticed when comparing "wild-bootstrap" to the "ri" sample_method p-values when the parameter of interest has no association with the outcome.

With the tight null t-distribution, the resulting p-value is too small.

To reproduce:

import pyfixest as pf
import numpy as np

import matplotlib.pyplot as plt

# Get data and randomize
data = pf.get_data()

np.random.default_rng(232)
data["X1"] = np.random.choice(data["X1"], size=data.shape[0], replace=False)

fit = pf.feols("Y ~ X1", data=data)

fit.summary()

Estimation: OLS Dep. var.: Y, Fixed effects: 0 Inference: iid Observations: 998

Coefficient	Estimate	Std. Error	t value	Pr(>	t	)	2.5%	97.5%
Intercept	-0.160	0.119	-1.344	0.179	-0.394	0.074
X1	0.033	0.090	0.367	0.714	-0.144	0.211

RMSE: 2.304 R2: 0.0

seed = 111
df_wild, df_t_wild = fit.wildboottest(param="X1", reps=9999, return_bootstrapped_t_stats=True, seed=seed)

rng = np.random.default_rng(232)
fit.ritest(resampvar="X1", reps=9999, type="randomization-t", store_ritest_statistics=True, rng=rng)

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))
ax[0].hist(fit._ritest_statistics, label="RI t stats", alpha=0.4);
ax[0].axvline(x=fit._ritest_sample_stat, linestyle="--", label="Observed RI t stats", color="black")
ax[0].legend()
ax[1].hist(df_t_wild, label="Wild t stats", alpha=0.4, color="orange");
ax[1].axvline(df_wild["t value"], label="Observed Wild t stat", color="black", linestyle="--");
ax[1].legend()

comparing_null_t_empirical_distributions

s3alfisc commented 2 days ago

Yes, this looks wrong! I'll take a look later. Thanks for reporting!

s3alfisc commented 1 day ago

At second thought, this might not necessarily be a bug, for two reasons:

slightly different nulls: first, the randomization inference estimator tests a "sharp" null hypothesis of no effect for any individual, i.e. we test that $H0:Yi(1)=Yi(0)$ for all i, which is slightly different from testing that the average treatment effect is zero (which is what we do when we run inference via the bootstrap).
different properties of the tests: it might be that the bootstrap is more conservative (or the ritest being less conservative), leading to different distributions

Will have to think about this more - took a look at the code & it looked mostly fine, though will have to check again. Width of the sampling interval differences looks indeed suspicious.

marcandre259 commented 1 day ago

Hi @s3alfisc,

Based on testing the sharp hypothesis with randomization inference, I would expect the boostrap approach to be the less conservative one then <- edit: actually I'd expect the opposite, since it should be easier to reject for at least one i than the average. Nevertheless, the paper blow gives the counterintuitive result that the randomization (sharp) approach is less powerful (paradox)

I'm quickly peeking in this paper that confirms this with simulations in table 1.

As far as progress on #698, I'll get back to including RI for Westfall-Young now that this is open.

py-econometrics / pyfixest

Randomization inference, "ri" sampling_method in rwolf, gives too tight a sample of null t-statistics #717