IPW and variable selection

BERENZ commented 12 months ago

Currently, the nonprobSel function applied to IPW estimator iterates over target parameter, i.e. variable selection is applied as many times as the number of variables specified in the target parameter. Here's an example

library(nonprobsvy)
library(sampling)

set.seed(2023-20-22)
N <- 10000
n_A <- 500
p <- 50
alpha_vec1 <- c(-2, 1, 1, 1,1, rep(0, p-5))
alpha_vec2 <- c(0,0,0,3,3,3,3, rep(0, p-7))
beta_vec <- c(1,0,0,1,1,1,1, rep(0, p-7))
X <- cbind("(Intercept)"=1, matrix(rnorm(N*(p-1)), nrow=N, byrow=T, 
                                   dimnames = list(NULL, paste0("X",1:(p-1)))))
X_formula  <- as.formula(paste("~", paste0("X",1:(p-1), collapse = " + ")))
X_totals <- colSums(X)
X_means <- colMeans(X[,-1])
Y_11 <- 1 + as.numeric(X %*% beta_vec) +   rnorm(N) ## OM I: linear model
Y_12 <- 1 + exp(3*sin(as.numeric(X %*% beta_vec))) + X[, "X5"] + X[, "X6"] + rnorm(N) ## OM II: nonlinear model
pi_Y_21 <- plogis(as.numeric(X %*% beta_vec)) ## OM III: linear model for binary Y
pi_Y_22 <- plogis(2 - log((X %*% beta_vec)^2) - 2*X[,"X5"] + 2*X[, "X6"]) ## OM IV: nonlinear model for binary Y
Y_21 <- rbinom(N, 1, prob = pi_Y_21)
Y_22 <- rbinom(N, 1, prob = pi_Y_22)
pi_A <- inclusionprobabilities(0.25 + abs(X[, "X1"]) + 0.03*abs(Y_11), n_A) ## inclusion based on Y_11 only 
pi_B1 <- plogis(as.numeric(X %*% alpha_vec1)) ## PSM I: linear probability
pi_B2 <- plogis(3.5 + as.numeric(log(X^2) %*% alpha_vec2) - 
                  sin(X[, "X3"] + X[, "X4"]) - X[,"X5"] + X[, "X6"]) ## PSM II: nonlinear 
flag_B1 <- rbinom(N, 1, prob = pi_B1)
flag_B2 <- rbinom(N, 1, prob = pi_B2)
flag_A <- UPpoisson(pik = pi_A)
pop_data <- data.frame(pi_A, flag_A, flag_B1, flag_B2, Y_11, Y_12, Y_21, Y_22, X[, 2:p])
sample_B1 <- subset(pop_data, flag_B1 == 1)
sample_B2 <- subset(pop_data, flag_B2 == 1)
sample_A <- subset(pop_data, flag_A == 1)
X_totals <- colSums(X)
X_means <- colMeans(X[,-1])
sample_A_svy <- svydesign(ids = ~ 1, 
                          probs = ~ pi_A, 
                          pps = poisson_sampling(sample_A$pi_A), 
                          data = sample_A)
sample_A_svy_cal <- calibrate(sample_A_svy, 
                              formula = as.formula(paste0("~", paste(names(X_totals)[2:p], collapse = "+"))),
                              population = X_totals, 
                              calfun = cal.raking)

## ipw estimator
y11_corr_one <- nonprob(selection = ~ X1 + X2 + X3 + X4,
                        target = ~ Y_11 + Y_12,
                        data = sample_B1,
                        svydesign = sample_A_svy_cal,
                        method_selection = "logit", 
                        control_inference = controlInf(vars_selection = TRUE), 
                        verbose = T)
## verbose
# Starting CV fold #1
# Starting CV fold #2
# Starting CV fold #3
# Starting CV fold #4
# Starting CV fold #5
# Starting CV fold #6
# Starting CV fold #7
# Starting CV fold #8
# Starting CV fold #9
# Starting CV fold #10
# Starting CV fold #1
# Starting CV fold #2
# Starting CV fold #3
# Starting CV fold #4
# Starting CV fold #5
# Starting CV fold #6
# Starting CV fold #7
# Starting CV fold #8
# Starting CV fold #9
# Starting CV fold #10

BERENZ commented 12 months ago

Furthermore, if I run the following code

y11_corr_one <- nonprob(selection = ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10,
                        target = ~ Y_11 + Y_12,
                        data = sample_B1,
                        svydesign = sample_A_svy_cal,
                        method_selection = "logit", 
                        control_inference = controlInf(vars_selection = TRUE), 
                        control_selection = controlSel(nfolds = 5, est_method_sel = "gee", h = 1),
                        verbose = T)

and then summary(y11_corr_one) variable names are missing

> y11_corr_one

Call:
nonprob(data = sample_B1, selection = ~X1 + X2 + X3 + X4 + X5 + 
    X6 + X7 + X8 + X9 + X10, target = ~Y_11 + Y_12, svydesign = sample_A_svy_cal, 
    method_selection = "logit", control_selection = controlSel(nfolds = 5, 
        est_method_sel = "gee", h = 1), control_inference = controlInf(vars_selection = TRUE), 
    verbose = T)

Estimated population mean with overall std.err and confidence interval:

         mean        SE lower_bound upper_bound
Y_11 2.133006 0.2454810    1.651872    2.614140
Y_12 6.624884 0.4378084    5.766795    7.482973
> summary(y11_corr_one)

Call:
nonprob(data = sample_B1, selection = ~X1 + X2 + X3 + X4 + X5 + 
    X6 + X7 + X8 + X9 + X10, target = ~Y_11 + Y_12, svydesign = sample_A_svy_cal, 
    method_selection = "logit", control_selection = controlSel(nfolds = 5, 
        est_method_sel = "gee", h = 1), control_inference = controlInf(vars_selection = TRUE), 
    verbose = T)

-------------------------
Estimated population mean: 2.1336.625 with overall std.err of: 0.2455
And std.err for nonprobability and probability samples being respectively:
0.07868 and 0.4378

95% Confidence inverval for popualtion mean:
     lower_bound upper_bound
Y_11    1.651872    2.614140
Y_12    5.766795    7.482973

Based on: Inverse probability weighted method
For a population of estimate size: 
Obtained on a nonprobability sample of size: 2267
With an auxiliary probability sample of size: 510
-------------------------

Regression coefficients:
-----------------------
For glm regression on selection variable:
     Estimate Std. Error z value P(>|z|)    
[1,] -1.98191    0.01218 -162.67  <2e-16 ***
[2,]  1.03636    0.01191   87.01  <2e-16 ***
[3,]  1.13838    0.01312   86.75  <2e-16 ***
[4,]  0.98006    0.01214   80.76  <2e-16 ***
[5,]  1.01880    0.01160   87.83  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-------------------------

Weights:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.003   1.320   1.932   4.411   3.467 209.363

It seems that this is the problem with h=1 and h=2.

LukaszChrostowski commented 12 months ago

DONE

ncn-foreigners / nonprobsvy

IPW and variable selection #15