ipums / ipumsr

Request, download, and read IPUMS data in R
https://tech.popdata.org/ipumsr/
Mozilla Public License 2.0
25 stars 5 forks source link

survey weights (created May 14, 2019 by @gergness on mnpopcenter/ipumsr) #14

Closed dtburk closed 4 months ago

dtburk commented 2 years ago

May 14, 2019 @gergness:

When I was first writing ipumsr I did some work translating the stata code on static pages of ipums.org to explain how to use survey weight variables. It's always been on my todo list to help projects update, but I never did get around to it.

Yesterday, two IPUMS users on twitter were talking about this: https://twitter.com/surlyurbanist/status/1127968834902605825

To make sure it doesn't get lost, here's the translation of CPS, USA & NHIS user notes on weights for R.


CPS - Replicate Weights

Adapted from https://cps.ipums.org/cps/repwt.shtml

IS THERE ANY WAY TO DO THIS AUTOMATICALLY IN MAJOR STATISTICAL PACKAGES?

In R, the survey package (and the srvyr package, which is based on the survey package) set up an object with the survey weighting information for you.

R (survey package)

# If not installed already: install.packages("survey")
library(survey)
svy <- svrepdesign(data = data, weight = ~WTSUPP, repweights = "REPWTP[0-9]+", type = "JK1", scale = 4/60, rscales = rep(1, 160), mse = TRUE)

R (srvyr package)

# If not installed already: install.packages("srvyr")
library(srvyr)
svy <- as_survey(data, weight = WTSUPP, repweights = matches("REPWTP[0-9]+"), type = "JK1", scale = 4/60, rscales = rep(1, 160), mse = TRUE)

After setting up the svy object, we can now use it to perform weighted calcuations. For example, to calculate the mean of a variable named VAR1:

R (survey package)

svymean(~VAR1, svy)

R (srvyr package)

svy %>% 
  summarize(mn = survey_mean(VAR1))

And we need to be careful to subset the replicate weights when subsetting. For example, if we wanted to subset to persons aged 25-64, we would run this command:

R (survey package)

svy_subset <- subset(svy, AGE >=25 & AGE < 65)
svymean(~VAR1, svy_subset)

R (srvyr package)

svy %>% 
  filter(AGE >= 25 & AGE < 65) %>%
  summarize(mn = survey_mean(VAR1))

USA - Replicate weights

Adapted from: https://usa.ipums.org/usa/repwt.shtml

IS THERE ANY WAY TO DO THIS AUTOMATICALLY IN MAJOR STATISTICAL PACKAGES?

In R, the survey package (and the srvyr package, which is based on the survey package) set up an object with the survey weighting information for you.

R (survey package)

# If not installed already: install.packages("survey")
library(survey)
svy <- svrepdesign(data = data, weight = ~PERWT, repweights = "REPWTP[0-9]+", type = "Fay", rho = 0.5, mse = TRUE)

R (srvyr package)

# If not installed already: install.packages("srvyr")
library(srvyr)
svy <- as_survey(data, weight = PERWT, repweights = matches("REPWTP[0-9]+"), type = "Fay", rho = 0.5, mse = TRUE)

After setting up the svy object, we can now use it to perform weighted calcuations. For example, to calculate the mean of a variable named VAR1:

R (survey package)

svymean(~VAR1, svy)

R (srvyr package)

svy %>% 
  summarize(mn = survey_mean(VAR1))

And we need to be careful to subset the replicate weights when subsetting. For example, if we wanted to subset to persons aged 25-64, we would run this command:

R (survey package)

svy_subset <- subset(svy, AGE >=25 & AGE < 65)
svymean(~VAR1, svy_subset)

R (srvyr package)

svy %>% 
  filter(AGE >= 25 & AGE < 65) %>%
  summarize(mn = survey_mean(VAR1))

IPUMS NHIS

Adapted from https://nhis.ipums.org/nhis/userNotes_variance.shtml

General Syntax to Account for Sample Design

The following general syntax will allow users to account for sampling weights and design variables when using STATA, SAS, SAS-callable SUDAAN, or R (through the survey or srvyr package) to estimate, for example, means using IPUMS NHIS data.

...

R (survey)

# If not installed already: install.packages("survey")
library(survey)
svy <- svydesign(data = data, ids = ~PSU, strata = ~STRATA, weights = ~PERWEIGHT, nest = TRUE)

svymean(~VAR1, svy)

R (srvyr)

# If not installed already: install.packages("srvyr")
library(srvyr)
svy <- as_survey(data, ids = PSU, strata = STRATA, weights = PERWEIGHT, nest = TRUE)

svy %>% 
  summarize(mn = survey_mean(VAR1))

Subsetting IPUMS NHIS Data

...

R (survey)

library(survey)
svy <- svydesign(data = data, ids = ~PSU, strata = ~STRATA, weights = ~PERWEIGHT, nest = TRUE)

svy_subset <- subset(svy, AGE >= 65)
svymean(~VAR1, svy_subset)

R (srvyr)

library(srvyr)
svy <- as_survey(data, ids = PSU, strata = STRATA, weights = PERWEIGHT, nest = TRUE)

svy %>% 
  filter(AGE >= 65) %>%
  summarize(mn = survey_mean(VAR1))
dtburk commented 2 years ago

Sep 1, 2020 @mhut214:

Hey Greg, thanks for providing this. Any insight into the Fay method used in IPUMS USA srvyr example. Is this mimicking the successive difference method? When I attempt the method you have outlined, I get very large standard errors. Any help would be appreciated!

dtburk commented 2 years ago

Sep 3, 2020 @gergness:

Hm, nothing stands out. I'm pretty sure I checked that the results matched between R and Stata when making that, but no longer have access to Stata. I'd recommend posting on the ipums forum or email ipums@umn.edu with a small example of something you're trying to calculate and seeing if it matches what they get in another statistical package.

dtburk commented 4 months ago

These pages have now all been updated to include example R code:

IPUMS CPS replicate weights: https://cps.ipums.org/cps/repwt.shtml IPUMS USA replicate weights: https://usa.ipums.org/usa/repwt.shtml IPUMS NHIS variance estimation: https://nhis.ipums.org/nhis/userNotes_variance.shtml