ncn-foreigners / nonprobsvy

An R package for modern methods for non-probability surveys
https://ncn-foreigners.github.io/nonprobsvy/
Other
29 stars 4 forks source link
lasso-regression nonprobability-sampling survey

nonprobsvy: an R package for modern statistical inference methods based on non-probability samples

R-CMD-check Codecov test
coverage DOI CRAN
status CRAN
downloads

Basic information

The goal of this package is to provide R users access to modern methods for non-probability samples when auxiliary information from the population or probability sample is available:

The package allows for:

Details on use of the package be found:

Installation

You can install the recent version of nonprobsvy package from main branch Github with:

remotes::install_github("ncn-foreigners/nonprobsvy")

or install the stable version from CRAN

install.packages("nonprobsvy")

or development version from the dev branch

remotes::install_github("ncn-foreigners/nonprobsvy@dev")

Basic idea

Consider the following setting where two samples are available: non-probability (denoted as $S_A$ ) and probability (denoted as $S_B$) where set of auxiliary variables (denoted as $\boldsymbol{X}$) is available for both sources while $Y$ and $\boldsymbol{d}$ (or $\boldsymbol{w}$) is present only in probability sample.

Sample Auxiliary variables $\boldsymbol{X}$ Target variable $Y$ Design ($\boldsymbol{d}$) or calibrated ($\boldsymbol{w}$) weights
$S_A$ (non-probability) 1 $\checkmark$ $\checkmark$ ?
$\checkmark$ $\checkmark$ ?
$n_A$ $\checkmark$ $\checkmark$ ?
$S_B$ (probability) $n_A+1$ $\checkmark$ ? $\checkmark$
$\checkmark$ ? $\checkmark$
$n_A+n_B$ $\checkmark$ ? $\checkmark$

Basic functionalities

Suppose $Y$ is the target variable, $\boldsymbol{X}$ is a matrix of auxiliary variables, $R$ is the inclusion indicator. Then, if we are interested in estimating the mean $\bar{\tau}_Y$ or the sum $\tau_Y$ of the of the target variable given the observed data set $(y_k, \boldsymbol{x}_k, R_k)$, we can approach this problem with the possible scenarios:

When unit-level data is available for non-probability survey only

Estimator Example code
Mass imputation based on regression imputation ``` r nonprob( outcome = y ~ x1 + x2 + ... + xk, data = nonprob, pop_totals = c(`(Intercept)`= N, x1 = tau_x1, x2 = tau_x2, ..., xk = tau_xk), method_outcome = "glm", family_outcome = "gaussian" ) ```
Inverse probability weighting ``` r nonprob( selection = ~ x1 + x2 + ... + xk, target = ~ y, data = nonprob, pop_totals = c(`(Intercept)` = N, x1 = tau_x1, x2 = tau_x2, ..., xk = tau_xk), method_selection = "logit" ) ```
Inverse probability weighting with calibration constraint ``` r nonprob( selection = ~ x1 + x2 + ... + xk, target = ~ y, data = nonprob, pop_totals = c(`(Intercept)`= N, x1 = tau_x1, x2 = tau_x2, ..., xk = tau_xk), method_selection = "logit", control_selection = controlSel(est_method_sel = "gee", h = 1) ) ```
Doubly robust estimator ``` r nonprob( selection = ~ x1 + x2 + ... + xk, outcome = y ~ x1 + x2 + …, + xk, pop_totals = c(`(Intercept)` = N, x1 = tau_x1, x2 = tau_x2, ..., xk = tau_xk), svydesign = prob, method_outcome = "glm", family_outcome = "gaussian" ) ```

When unit-level data are available for both surveys

Estimator Example code
Mass imputation based on regression imputation ``` r nonprob( outcome = y ~ x1 + x2 + ... + xk, data = nonprob, svydesign = prob, method_outcome = "glm", family_outcome = "gaussian" ) ```
Mass imputation based on nearest neighbour imputation ``` r nonprob( outcome = y ~ x1 + x2 + ... + xk, data = nonprob, svydesign = prob, method_outcome = "nn", family_outcome = "gaussian", control_outcome = controlOutcome(k = 2) ) ```
Mass imputation based on predictive mean matching ``` r nonprob( outcome = y ~ x1 + x2 + ... + xk, data = nonprob, svydesign = prob, method_outcome = "pmm", family_outcome = "gaussian" ) ```
Mass imputation based on regression imputation with variable selection (LASSO) ``` r nonprob( outcome = y ~ x1 + x2 + ... + xk, data = nonprob, svydesign = prob, method_outcome = "pmm", family_outcome = "gaussian", control_outcome = controlOut(penalty = "lasso"), control_inference = controlInf(vars_selection = TRUE) ) ```
Inverse probability weighting ``` r nonprob( selection = ~ x1 + x2 + ... + xk, target = ~ y, data = nonprob, svydesign = prob, method_selection = "logit" ) ```
Inverse probability weighting with calibration constraint ``` r nonprob( selection = ~ x1 + x2 + ... + xk, target = ~ y, data = nonprob, svydesign = prob, method_selection = "logit", control_selection = controlSel(est_method_sel = "gee", h = 1) ) ```
Inverse probability weighting with calibration constraint with variable selection (SCAD) ``` r nonprob( selection = ~ x1 + x2 + ... + xk, target = ~ y, data = nonprob, svydesign = prob, method_outcome = "pmm", family_outcome = "gaussian", control_inference = controlInf(vars_selection = TRUE) ) ```
Doubly robust estimator ``` r nonprob( selection = ~ x1 + x2 + ... + xk, outcome = y ~ x1 + x2 + ... + xk, data = nonprob, svydesign = prob, method_outcome = "glm", family_outcome = "gaussian" ) ```
Doubly robust estimator with variable selection (SCAD) and bias minimization ``` r nonprob( selection = ~ x1 + x2 + ... + xk, outcome = y ~ x1 + x2 + ... + xk, data = nonprob, svydesign = prob, method_outcome = "glm", family_outcome = "gaussian", control_inference = controlInf( vars_selection = TRUE, bias_correction = TRUE ) ) ```

Examples

Simulate example data from the following paper: Kim, Jae Kwang, and Zhonglei Wang. “Sampling techniques for big data analysis.” International Statistical Review 87 (2019): S177-S191 [section 5.2]

library(survey)
library(nonprobsvy)

set.seed(1234567890)
N <- 1e6 ## 1000000
n <- 1000
x1 <- rnorm(n = N, mean = 1, sd = 1)
x2 <- rexp(n = N, rate = 1)
epsilon <- rnorm(n = N) # rnorm(N)
y1 <- 1 + x1 + x2 + epsilon
y2 <- 0.5*(x1 - 0.5)^2 + x2 + epsilon
p1 <- exp(x2)/(1+exp(x2))
p2 <- exp(-0.5+0.5*(x2-2)^2)/(1+exp(-0.5+0.5*(x2-2)^2))
flag_bd1 <- rbinom(n = N, size = 1, prob = p1)
flag_srs <- as.numeric(1:N %in% sample(1:N, size = n))
base_w_srs <- N/n
population <- data.frame(x1,x2,y1,y2,p1,p2,base_w_srs, flag_bd1, flag_srs)
base_w_bd <- N/sum(population$flag_bd1)

Declare svydesign object with survey package

sample_prob <- svydesign(ids= ~1, weights = ~ base_w_srs, 
                         data = subset(population, flag_srs == 1))

Estimate population mean of y1 based on doubly robust estimator using IPW with calibration constraints.

result_dr <- nonprob(
  selection = ~ x2,
  outcome = y1 ~ x1 + x2,
  data = subset(population, flag_bd1 == 1),
  svydesign = sample_prob
)

Results

summary(result_dr)
#> 
#> Call:
#> nonprob(data = subset(population, flag_bd1 == 1), selection = ~x2, 
#>     outcome = y1 ~ x1 + x2, svydesign = sample_prob)
#> 
#> -------------------------
#> Estimated population mean: 2.95 with overall std.err of: 0.04195
#> And std.err for nonprobability and probability samples being respectively:
#> 0.000783 and 0.04195
#> 
#> 95% Confidence inverval for popualtion mean:
#>    lower_bound upper_bound
#> y1    2.867789     3.03224
#> 
#> 
#> Based on: Doubly-Robust method
#> For a population of estimate size: 1025063
#> Obtained on a nonprobability sample of size: 693011
#> With an auxiliary probability sample of size: 1000
#> -------------------------
#> 
#> Regression coefficients:
#> -----------------------
#> For glm regression on outcome variable:
#>             Estimate Std. Error z value P(>|z|)    
#> (Intercept) 0.996282   0.002139   465.8  <2e-16 ***
#> x1          1.001931   0.001200   835.3  <2e-16 ***
#> x2          0.999125   0.001098   910.2  <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> -----------------------
#> For glm regression on selection variable:
#>              Estimate Std. Error z value P(>|z|)    
#> (Intercept) -0.498997   0.003702  -134.8  <2e-16 ***
#> x2           1.885629   0.005303   355.6  <2e-16 ***
#> -------------------------
#> 
#> Weights:
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.000   1.071   1.313   1.479   1.798   2.647 
#> -------------------------
#> 
#> Residuals:
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> -0.99999  0.06603  0.23778  0.26046  0.44358  0.62222 
#> 
#> AIC: 1010622
#> BIC: 1010645
#> Log-Likelihood: -505309 on 694009 Degrees of freedom

Mass imputation estimator

result_mi <- nonprob(
  outcome = y1 ~ x1 + x2,
  data = subset(population, flag_bd1 == 1),
  svydesign = sample_prob
)

Results

summary(result_mi)
#> 
#> Call:
#> nonprob(data = subset(population, flag_bd1 == 1), outcome = y1 ~ 
#>     x1 + x2, svydesign = sample_prob)
#> 
#> -------------------------
#> Estimated population mean: 2.95 with overall std.err of: 0.04203
#> And std.err for nonprobability and probability samples being respectively:
#> 0.001227 and 0.04201
#> 
#> 95% Confidence inverval for popualtion mean:
#>    lower_bound upper_bound
#> y1    2.867433    3.032186
#> 
#> 
#> Based on: Mass Imputation method
#> For a population of estimate size: 1e+06
#> Obtained on a nonprobability sample of size: 693011
#> With an auxiliary probability sample of size: 1000
#> -------------------------
#> 
#> Regression coefficients:
#> -----------------------
#> For glm regression on outcome variable:
#>             Estimate Std. Error z value P(>|z|)    
#> (Intercept) 0.996282   0.002139   465.8  <2e-16 ***
#> x1          1.001931   0.001200   835.3  <2e-16 ***
#> x2          0.999125   0.001098   910.2  <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> -------------------------

Inverse probability weighting estimator

result_ipw <- nonprob(
  selection = ~ x2,
  target = ~y1,
  data = subset(population, flag_bd1 == 1),
  svydesign = sample_prob)

Results

summary(result_ipw)
#> 
#> Call:
#> nonprob(data = subset(population, flag_bd1 == 1), selection = ~x2, 
#>     target = ~y1, svydesign = sample_prob)
#> 
#> -------------------------
#> Estimated population mean: 2.925 with overall std.err of: 0.05
#> And std.err for nonprobability and probability samples being respectively:
#> 0.001586 and 0.04997
#> 
#> 95% Confidence inverval for popualtion mean:
#>    lower_bound upper_bound
#> y1     2.82679    3.022776
#> 
#> 
#> Based on: Inverse probability weighted method
#> For a population of estimate size: 1025063
#> Obtained on a nonprobability sample of size: 693011
#> With an auxiliary probability sample of size: 1000
#> -------------------------
#> 
#> Regression coefficients:
#> -----------------------
#> For glm regression on selection variable:
#>              Estimate Std. Error z value P(>|z|)    
#> (Intercept) -0.498997   0.003702  -134.8  <2e-16 ***
#> x2           1.885629   0.005303   355.6  <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> -------------------------
#> 
#> Weights:
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.000   1.071   1.313   1.479   1.798   2.647 
#> -------------------------
#> 
#> Residuals:
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> -0.99999  0.06603  0.23778  0.26046  0.44358  0.62222 
#> 
#> AIC: 1010622
#> BIC: 1010645
#> Log-Likelihood: -505309 on 694009 Degrees of freedom

Funding

Work on this package is supported by the National Science Centre, OPUS 22 grant no. 2020/39/B/HS4/00941.

References (selected)

Chen, Yilin, Pengfei Li, and Changbao Wu. 2020. “Doubly Robust Inference With Nonprobability Survey Samples.” *Journal of the American Statistical Association* 115 (532): 2011–21. .
Kim, Jae Kwang, Seho Park, Yilin Chen, and Changbao Wu. 2021. “Combining Non-Probability and Probability Survey Samples Through Mass Imputation.” *Journal of the Royal Statistical Society Series A: Statistics in Society* 184 (3): 941–63. .
Lumley, Thomas. 2004. “Analysis of Complex Survey Samples.” *Journal of Statistical Software* 9 (1): 1–19.
———. 2023. “Survey: Analysis of Complex Survey Samples.”
Wu, Changbao. 2023. “Statistical Inference with Non-Probability Survey Samples.” *Survey Methodology* 48 (2): 283–311. .
Yang, Shu, Jae Kwang Kim, and Youngdeok Hwang. 2021. “Integration of Data from Probability Surveys and Big Found Data for Finite Population Inference Using Mass Imputation.” *Survey Methodology* 47 (1): 29–58. .
Yang, Shu, Jae Kwang Kim, and Rui Song. 2020. “Doubly Robust Inference When Combining Probability and Non-Probability Samples with High Dimensional Data.” *Journal of the Royal Statistical Society Series B: Statistical Methodology* 82 (2): 445–65. .