Open ylelkes opened 1 year ago
The recodings of the education to factors were incorrect. It should have been that for this partition,
table of codes ending in: associated education level
_006
: Bachelor's degree or higher
_011
: Bachelor's degree or higher
_004
: High school graduate (includes equivalency)
_009
: High school graduate (includes equivalency)
_005
: Some college or associate's degree
_010
: Some college or associate's degree
Below is a demo without the get_acs_cces
wrapper.
I will try to fix it soon in dev
, but I might need to do be creative since the way education is grouped here is different from the other partition acscodes_age_sex_educ
that we were using in the paper. Apologies.
Variables
library(ccesMRPprep)
library(tidycensus)
library(dplyr)
acs5_vars <- load_variables("acs5", year = 2021)
acs5_vars |>
filter(name %in% acscodes_sex_educ_race)
#> # A tibble: 48 × 4
#> name label concept geography
#> <chr> <chr> <chr> <chr>
#> 1 C15002B_004 Estimate!!Total:!!Male:!!High school graduate … SEX BY… tract
#> 2 C15002B_005 Estimate!!Total:!!Male:!!Some college or assoc… SEX BY… tract
#> 3 C15002B_006 Estimate!!Total:!!Male:!!Bachelor's degree or … SEX BY… tract
#> 4 C15002B_009 Estimate!!Total:!!Female:!!High school graduat… SEX BY… tract
#> 5 C15002B_010 Estimate!!Total:!!Female:!!Some college or ass… SEX BY… tract
#> 6 C15002B_011 Estimate!!Total:!!Female:!!Bachelor's degree o… SEX BY… tract
#> 7 C15002C_004 Estimate!!Total:!!Male:!!High school graduate … SEX BY… tract
#> 8 C15002C_005 Estimate!!Total:!!Male:!!Some college or assoc… SEX BY… tract
#> 9 C15002C_006 Estimate!!Total:!!Male:!!Bachelor's degree or … SEX BY… tract
#> 10 C15002C_009 Estimate!!Total:!!Female:!!High school graduat… SEX BY… tract
#> # ℹ 38 more rows
Created on 2023-05-08 with reprex v2.0.2
Thank you!
From: Shiro Kuriwaki @.> Date: Monday, May 8, 2023 at 10:33 PM To: kuriwaki/ccesMRPprep @.> Cc: ylelkes @.>, Author @.> Subject: Re: [kuriwaki/ccesMRPprep] missing education codes (Issue #12)
The recodings of the education to factors were incorrect. It should have been that for this partition,
table of codes ending in: associated education level
Below is a demo without the get_acs_cces wrapper.
I will try to fix it soon in dev, but I might need to do be creative since the way education is grouped here is different from the other partition acscodes_age_sex_educ that we were using in the paper. Apologies.
Variables
library(ccesMRPprep)
library(tidycensus)
library(dplyr)
acs5_vars <- load_variables("acs5", year = 2021)
acs5_vars |>
filter(name %in% acscodes_sex_educ_race)
Created on 2023-05-08 with reprex v2.0.2https://reprex.tidyverse.org
— Reply to this email directly, view it on GitHubhttps://github.com/kuriwaki/ccesMRPprep/issues/12#issuecomment-1539079337, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAM5U6WDG56IY2P6AQ5UAKDXFGUGNANCNFSM6AAAAAAX2KKZB4. You are receiving this because you authored the thread.Message ID: @.***>
in case it's helpful, here is what i did, combining your code with some of my own:
library(ccesMRPprep)
library(tidycensus)
library(tidyverse)
library(glue)
# categories in ACS
ages <- c("18 to 24 years",
"25 to 34 years",
"35 to 44 years",
"45 to 64 years",
"65 years and over",
"18 and 19 years",
"20 to 24 years",
"25 to 29 years",
"30 to 34 years",
"35 to 44 years",
"45 to 54 years",
"55 to 64 years",
"65 to 74 years",
"75 to 84 years",
"85 years and over")
education <- c("Less than high school diploma",
"High school graduate \\(includes equivalency\\)",
"Some college or associate's degree",
"Bachelor's degree or higher")
races <- c("White alone, not Hispanic or Latino",
"Hispanic or Latino",
"Black or African American alone",
"American Indian and Alaska Native alone",
"Asian alone",
"Native Hawaiian and Other Pacific Islander alone",
"Some other race alone",
"Two or more races" #,
# "Two or more races!!Two races including Some other race",
# "Two or more races!!Two races excluding Some other race, and three or more races"
)
ages_regex <- as.character(glue("({str_c(ages, collapse = '|')})"))
edu_regex <- as.character(glue("({str_c(education, collapse = '|')})"))
races_regex <- as.character(glue("({str_c(races, collapse = '|')})"))
# get vars ----
acs5_vars <- load_variables("acs5", year = 2021)
vars <- acs5_vars |>
filter(name %in% c(acscodes_sex_educ_race,paste0("C15002",LETTERS[2:9],"_003")))
# format these and recode
# to strings ----
vars <- vars %>%
mutate(variable = name) %>%
separate(name, sep = "_", into = c("table", "num")) %>%
select(variable, table, concept, num, label, everything()) %>%
filter(str_detect(label, "Total")) %>%
mutate(label = str_remove(label, "Estimate!!Total")) %>%
mutate(gender = str_extract(label, "(Male|Female)"),
age = str_extract(label, ages_regex),
educ = str_extract(label, edu_regex),
race = coalesce(str_extract(label, regex(races_regex, ignore_case = TRUE)),
str_extract(concept, regex(races_regex, ignore_case = TRUE))))
acs_tab <- tidycensus::get_acs(geography = "congressional district",variables = vars$variable,year = 2021)
out <- acs_tab %>% left_join(vars,by = "variable")
out <- out %>% mutate(race_new =
case_when(race=="WHITE ALONE, NOT HISPANIC OR LATINO"~"White",
race=="BLACK OR AFRICAN AMERICAN ALONE"~"Black",
race=="ASIAN ALONE"~"Asian",
race=="AMERICAN INDIAN AND ALASKA NATIVE ALONE"~"Native American",
race=="HISPANIC OR LATINO"~"Hispanic",
TRUE ~ "All Other"),
educ_new = case_when(educ=="Less than high school diploma"~'HS or Less',
educ=="High school graduate (includes equivalency)"~'HS or Less',
TRUE~educ))
out$educ_race <- interaction(out$race_new,out$educ_new)
post_strat <- out %>% group_by(GEOID,gender,educ=educ_new,educ_race,race=race_new) %>% summarise(n=sum(estimate))%>% group_by(GEOID) %>% mutate(cd_total=sum(n),cd_per=n/cd_total) %>% drop_na()
# create a vector of state abbreviations and FIPS codes
state_info <- data.frame(
state_abbrev = c(
'AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA',
'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD',
'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ',
'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC',
'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY', 'DC'
),
state_fips = c(
'01', '02', '04', '05', '06', '08', '09', '10', '12', '13',
'15', '16', '17', '18', '19', '20', '21', '22', '23', '24',
'25', '26', '27', '28', '29', '30', '31', '32', '33', '34',
'35', '36', '37', '38', '39', '40', '41', '42', '44', '45',
'46', '47', '48', '49', '50', '51', '53', '54', '55', '56', '11'
)
)
# create a function to convert a vector of geoids to state abbreviations and district numbers
convert_geoids <- function(geoids) {
# create an empty vector to store the converted geocodes
converted <- vector(mode = "character", length = length(geoids))
for (i in seq_along(geoids)) {
# extract the state FIPS code and congressional district number from the geoid
state_fips <- substr(geoids[i], 1, 2)
district <- substr(geoids[i], 3, 4)
# convert the state FIPS code to a state abbreviation
if (state_fips == "11") {
# Washington DC
state_abbrev <- "DC"
} else {
# find the state abbreviation corresponding to the state FIPS code
state_index <- which(state_info$state_fips == state_fips)
state_abbrev <- state_info$state_abbrev[state_index]
}
# convert district "00" to "01"
if (district == "00") {
district <- "01"
}
# convert district "98" to "01"
if (district == "98") {
district <- "01"
}
# combine the state abbreviation and district number into a single string
converted[i] <- paste(state_abbrev, district, sep = "-")
}
return(converted)
}
# create a function to extract state FIPS codes from converted geocodes
extract_state_fips <- function(converted_geocodes) {
# create an empty vector to store the state FIPS codes
state_fips <- vector(mode = "character", length = length(converted_geocodes))
for (i in seq_along(converted_geocodes)) {
# split the converted geocode into state abbreviation and district number
parts <- strsplit(converted_geocodes[i], "-")[[1]]
state_abbrev <- parts[1]
if (state_abbrev == "DC") {
# Washington DC
state_fips[i] <- "11"
} else {
# find the state FIPS code corresponding to the state abbreviation
state_index <- which(state_info$state_abbrev == state_abbrev)
if (length(state_index) == 0) {
# no matching state abbreviation found
state_fips[i] <- NA
} else {
state_fips[i] <- state_info$state_fips[state_index]
}
}
}
return(state_fips)
}
post_strat$cd <- convert_geoids(post_strat$GEOID)
post_strat$STATEFIP <- extract_state_fips(post_strat$cd)
post_strat$GEOID[is.na(post_strat$STATEFIP)]
post_strat$pct_trump <- cd_info_2020$pct_trump[match(post_strat$cd,cd_info_2020$cd)]
post_strat$trumpvote5050 <- abs(post_strat$pct_trump-.50)
save(post_strat,file = "poststrat_cd.RData")
I started some edits in https://github.com/kuriwaki/ccesMRPprep/tree/fix_iss12 which I will update and merge.
@ylelkes thanks for the suggested code. It seems like in this case, education is the three way coding where BA and post-grads is one level. Which means that the CES survey data will need to be recoded to fit with that coding. This goes to what I am contemplating: whether to try and distinguish between 4-way education and educ_3
which is this three-way coding in the output of get_acs_cces
, OR whether to try and coalesce to one (so the user will need to know how to recode education in the CES side).
The edits in the branch currently does the latter
I changed plans and will do the former -- formally distinguish between educ
(4-way) and educ_3
(3-way). This way there is no ambiguity between the two types when obtaining data. I added educ_3
as a standard demographic in ccc_std_demographics().
library(ccesMRPprep)
library(dplyr)
packageVersion("ccesMRPprep")
#> [1] '0.1.11.9999'
acs_tab <- get_acs_cces(
varlab_df = acscodes_df,
varlist = acscodes_sex_educ_race,
year = 2021,
dataset = "acs5"
)
#> Getting data from the 2017-2021 5-year ACS
# no educ, only educ_3
acs_tab
#> # A tibble: 28,096 × 9
#> acscode year cd gender female educ_3 race count count_moe
#> <chr> <dbl> <chr> <fct> <int> <fct> <fct> <dbl> <dbl>
#> 1 C15002B_003 2021 AL-01 Male 0 HS or Less Black 9804 713
#> 2 C15002B_004 2021 AL-01 Male 0 HS or Less Black 25034 1256
#> 3 C15002B_005 2021 AL-01 Male 0 Some College Black 14180 995
#> 4 C15002B_006 2021 AL-01 Male 0 4-Year or Post-G… Black 6767 804
#> 5 C15002B_008 2021 AL-01 Female 1 HS or Less Black 8304 734
#> 6 C15002B_009 2021 AL-01 Female 1 HS or Less Black 23756 1237
#> 7 C15002B_010 2021 AL-01 Female 1 Some College Black 23514 945
#> 8 C15002B_011 2021 AL-01 Female 1 4-Year or Post-G… Black 13255 840
#> 9 C15002C_003 2021 AL-01 Male 0 HS or Less Nati… 520 200
#> 10 C15002C_004 2021 AL-01 Male 0 HS or Less Nati… 624 168
#> # ℹ 28,086 more rows
count(acs_tab, educ_3)
#> # A tibble: 3 × 2
#> educ_3 n
#> <fct> <int>
#> 1 HS or Less 14048
#> 2 Some College 7024
#> 3 4-Year or Post-Grad 7024
# This one gets you four-way education
get_acs_cces(
varlab_df = acscodes_df,
varlist = acscodes_age_sex_educ,
year = 2021,
dataset = "acs5"
)
#> Getting data from the 2017-2021 5-year ACS
#> # A tibble: 30,730 × 9
#> acscode year cd gender female educ age count count_moe
#> <chr> <dbl> <chr> <fct> <int> <fct> <fct> <dbl> <dbl>
#> 1 B15001_004 2021 AL-01 Male 0 HS or Less 18 to 24 y… 329 146
#> 2 B15001_005 2021 AL-01 Male 0 HS or Less 18 to 24 y… 4670 591
#> 3 B15001_006 2021 AL-01 Male 0 HS or Less 18 to 24 y… 12135 801
#> 4 B15001_007 2021 AL-01 Male 0 Some College 18 to 24 y… 9791 841
#> 5 B15001_008 2021 AL-01 Male 0 Some College 18 to 24 y… 1507 361
#> 6 B15001_009 2021 AL-01 Male 0 4-Year 18 to 24 y… 1379 324
#> 7 B15001_010 2021 AL-01 Male 0 Post-Grad 18 to 24 y… 65 44
#> 8 B15001_012 2021 AL-01 Male 0 HS or Less 25 to 34 y… 1009 281
#> 9 B15001_013 2021 AL-01 Male 0 HS or Less 25 to 34 y… 4920 742
#> 10 B15001_014 2021 AL-01 Male 0 HS or Less 25 to 34 y… 15597 1116
#> # ℹ 30,720 more rows
# ccc_std_demographics provides BOTH educ and educ_3
ccc_std_demographics(ccc_samp) |>
count(educ, educ_3)
#> age variable modified to bins. Original age variable is now in age_orig.
#> # A tibble: 4 × 3
#> educ educ_3 n
#> <dbl+lbl> <dbl+lbl> <int>
#> 1 1 [HS or Less] 1 [HS or Less] 293
#> 2 2 [Some College] 2 [Some College] 355
#> 3 3 [4-Year] 3 [4-Year or Post-Grad] 244
#> 4 4 [Post-Grad] 3 [4-Year or Post-Grad] 108
Created on 2023-06-19 with reprex v2.0.2
Hi there, Any idea why acscodes_age_sex_educ only returns rows for "HS or Less" and "4-Year"?
acs_tab <- get_acs_cces( varlist = acscodes_sex_educ_race, varlab_df = acscodes_df, year = 2021,dataset = "acs5" )
thanks!