ozanj / rclass

4 stars 3 forks source link

adding socioeconomic/demographic measures to western washington prospect list data #9

Open ozanj opened 6 years ago

ozanj commented 6 years ago

@cyouh95

for the prospect list from western washington university, the measures of median income and race are at the zip-code level, correct?

how hard would it be to add city-level and state-level measures of?:

this would be used for a problem set I have to create by Friday

cyouh95 commented 6 years ago

@ozanj yup, they are zip-code data! State-level shouldn't be hard to get. And it looks like city-level should be possible as well, if 312 here is the correct geographic unit. Just hope merging in the city data won't be too big an issue, if that needs to be done by city name/state?

ozanj commented 6 years ago

@cyouh95 @mpatricia01

Thanks Crystal, can you add the state-level variables. let me know when that is done and we'll start working those vars into the problem set.

after that is done, try adding city vars but if it looks like it will take a long time or if it looks like quality of merge will be bad then don't add them.

thank you!

cyouh95 commented 6 years ago

@ozanj @mpatricia01 Here is the CSV w/ the state-level variables: 45f9ebf

cyouh95 commented 6 years ago

@ozanj @mpatricia01 Ah, just realized there is a STAT_CODE, ZIP, HS_STATE, and HS_CITY in the original data. The home state (STAT_CODE) and HS_STATE might be different (But I think mostly because many rows are missing the HS_STATE/HS_CITY fields).

Currently, state-level data is merged to STAT_CODE. Zip-code level was to (home) ZIP as well, but I guess city-level data would have to be the HS's city?

ozanj commented 6 years ago

@cyouh95 @mpatricia01

Prior to seeing note you just sent, I just pulled changes to wwlist_merged_state.csv

then I modified create_prospect_list.R and modified wwlist_merged.Rdata and pushed those changes to github.

why don't you check out whether stuff I did still works in light of note you just sent.

ozanj commented 6 years ago

@cyouh95 @mpatricia01

Separate request: can you provide Patricia and I with information about how the prospect list defines race/ethnicity [var=ethn_code] and how ACS data defines race/ethnicity?

Patricia will create lead in developing draft of problem set. and one set of questions will compare the race/ethnicity of prospects purchased [at zip-code level or state level] to overall race/ethnicity composition at the zip-code level or state-level.

a potential concern is that prospect list definition of race/ethnicity may differ from ACS definition, so would be helpful for Patricia to see the definitions so that she can make decisions about what is possible to ask students to do.

cyouh95 commented 6 years ago

@ozanj @mpatricia01 Everything should still work fine - I think we do want the state-level census data merged to STAT_CODE instead of HS_STATE right? (since the latter field is missing for more obs) But don't think there's many cases where the states would differ!

But if we can get city-level data, then it'd have to be merged to HS_CITY, whereas zip-code (and state-level) are to home location, if that's okay?

cyouh95 commented 6 years ago

@ozanj @mpatricia01 Here are the definitions according to CollegeBoard [x][x]

There should be 2 questions in the above questionnaire (race and ethnicity), but looks like it may be combined in the wwlist's ETHN_CODE field:

Cuban
Mexican/Mexican American
Puerto Rican
Other Spanish/Hispanic

American Indian or Alaska Native
Asian or Native Hawaiian or Other Pacific Islander
Asian or Native Hawaiian or Other Pacific IslanderH [typo of above - and this might be using the pre-2016 version? but RECEIVE_DATE is in 2016]
Black or African American
White

Other-2 or more
Not reported

And here are the ACS variables/definitions [x][x]

ozanj commented 6 years ago

sounds good.

yes, for merging state-level data to prospect-level data, it seems conceptually best to merge to state_code rather than hs_state because conceptually I think we may be more focused on where student lives rather than state of HS attended. this is holding aside the data completeness issue.

In examples in class, I've generally been using state_code rather than HS state.

let's hold off on adding any city_level variables. I think zip_code level analyses and state_level analyses are sufficient. adding more would make problem set too long.

thansk for all this Crystal!