kosukeimai / wru

Who Are You? Bayesian Prediction of Racial Category Using Surname and Geolocation
130 stars 30 forks source link

name.clean --- 2010 data + changing argument name from voter #11

Closed soodoku closed 7 years ago

soodoku commented 7 years ago

Dear All,

A few minor suggestions:

See here for 2010: http://www.census.gov/topics/population/genealogy/data/2010_surnames.html

say there is a last name called 'canes wrone' and we match canes and then we match wrones. potentially you can have two matches in census data. And it may be a good idea to produce multiple left_joins.

kosukeimai commented 7 years ago

@HJ08003 Can you check out the update posted in the update-surname-handling branch? In particular, if you can do some testing to make sure that there is no bug, that would be great. If you find any issues, please post it here. Thanks.

kkprinceton commented 7 years ago

Thanks for the suggestions. I'm working on incorporating them in this new branch: https://github.com/kosukeimai/wru/tree/update-surname-handling.

HJ08003 commented 7 years ago

Hi Kabir and Kosuke,

I did the test, here are my findings:

We need to decide on the solution on race.pred. After that, we can update all the documentation accordingly.

Thanks,

Hubert

soodoku commented 7 years ago

@HJ08003 you may want to edit to remove the census API key. @kosukeimai caught me doing that once for a pull request also

kosukeimai commented 7 years ago

Yes, @HJ08003 Please do not use the census key in a public place like this. Also, use the markdown grammar so that your comments are easier to read.

HJ08003 commented 7 years ago

Hi Kabir,

When surname.only is set to TRUE, the following will be done

Surname-Only Predictions

if (surname.only == TRUE) { for (k in 1:length(eth)) { voter.file[paste("pred", eth[k], sep = ".")] <- voter.file[paste("p", eth[k], sep = "_")] } pred <- paste("pred", eth, sep = ".") return(voter.file[c(vars.orig, pred)]) }

Now, can this be done only to the portion of the data that does not have corresponding census object?

Alternatively, we could have the following logic built-in for parameters:

(1) All the default settings will guarantee the code will run for race.pred(voters). (2) If the census information is incomplete and conflict, the code will issue a warning and exit (do nothing). The census information is complete if (a) a valid census key is provided, or (2) a census object is provided and that covers all the states involved in the data (voters)

At this moment, the input parameters of race.pred are: function(voter.file, races = c("white", "black", "latino", "asian", "other"), census.surname = TRUE, surname.only = FALSE, census.geo, census.key, demo = FALSE, census.data = NA, party)

Could you suggest the default values for

census.geo (be county?)
census.key (be NULL?)
party (?)

Thanks,

-Hubert

From: Hubert Jin hubertj@exchange.Princeton.EDU<mailto:hubertj@exchange.Princeton.EDU> Date: Tuesday, February 14, 2017 at 3:24 PM To: Kabir Khanna kkhanna@exchange.Princeton.EDU<mailto:kkhanna@exchange.Princeton.EDU> Cc: Hubert Jin hubertj@exchange.Princeton.EDU<mailto:hubertj@exchange.Princeton.EDU>, Mention mention@noreply.github.com<mailto:mention@noreply.github.com>, kosukeimai/wru reply@reply.github.com<mailto:reply@reply.github.com> Subject: Re: [kosukeimai/wru] name.clean --- 2010 data + changing argument name from voter (#11)

Hi Kabir and Kosuke,

I did the test, here are my findings:

(0) The code works when the parameters are provided correctly, based on the testing I did. The script is included below. (1) Some documentation need to be updated accordingly: README.Md, and the documentation/example for the external functions. (2) In race.pred, it will issue a warning and set surname.only <- T, when the census key is not provided. So, even when census object is provided and its content covers the states involved in the data, it will still do that. Thus the result will be different. I think the decision (issue a warning and set surname.only <- T) should be postpone to later stage when it is clear that census object does not cover a particular state – which should be case by case. For example, the test sample data has DC/NJ/NY. If a census object only covers corresponding NY/DC, surname.only <- T should be only applied to NJ portion of data.

We need to decide on the solution of (2). After that, we can update all (1) accordingly.

Thanks,

-Hubert

k = “your key"

y <- getCensusData(k, state = c("NY", "DC"), demo = FALSE)

yy <- getCensusData(k, state = c("DE", "NJ"), demo = FALSE)

yyy <- y

yyy[["NJ"]] <- yy[["NJ"]]

yyyy <- getCensusData(k, state = "FL", demo = FALSE)

yyyy[["NJ"]] <- yy[["NJ"]]

yyyy[["NY"]] <- y[["NY"]]

yyyy[["DC"]] <- y[["DC"]]

data(voters)

as.character(unique(voters$state))

names(y)

names(yy)

names(yyy)

names(yyyy)

testing the tract related code (x4 vs x5 is the effect of census key not provided when there is no need)

#

x0 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.key = k, party = "PID")

x1 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.key = k, census.data = y, party = "PID")

x2 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.key = k, census.data = yy, party = "PID")

x3 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.key = k, census.data = yyy, party = "PID")

x4 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.key = k, census.data = yyyy, party = "PID")

x5 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.data = yyyy, party = "PID")

testing the block related code (x4 vs x5 is the effect of census key not provided when there is no need)

#

x0 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.key = k, party = "PID")

x1 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.key = k, census.data = y, party = "PID")

x2 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.key = k, census.data = yy, party = "PID")

x3 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.key = k, census.data = yyy, party = "PID")

x4 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.key = k, census.data = yyyy, party = "PID")

x5 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.data = yyyy, party = "PID")

The following is to compare result from each of the above group. (0, 165, 170, 0) means the result is same in prediction probability, and in the number of NAs.

#

x <- x1

sum(x != x0, na.rm=T)

sum(x == x0, na.rm=T)

sum(is.na(x) == is.na(x0), na.rm=T)

sum(is.na(x) != is.na(x0), na.rm=T)

x <- x2

sum(x != x0, na.rm=T)

sum(x == x0, na.rm=T)

sum(is.na(x) == is.na(x0), na.rm=T)

sum(is.na(x) != is.na(x0), na.rm=T)

x <- x3

sum(x != x0, na.rm=T)

sum(x == x0, na.rm=T)

sum(is.na(x) == is.na(x0), na.rm=T)

sum(is.na(x) != is.na(x0), na.rm=T)

x <- x4

sum(x != x0, na.rm=T)

sum(x == x0, na.rm=T)

sum(is.na(x) == is.na(x0), na.rm=T)

sum(is.na(x) != is.na(x0), na.rm=T)

x <- x5

sum(x != x0, na.rm=T)

sum(x == x0, na.rm=T)

sum(is.na(x) == is.na(x0), na.rm=T)

sum(is.na(x) != is.na(x0), na.rm=T)

From: kkprinceton notifications@github.com<mailto:notifications@github.com> Reply-To: kosukeimai/wru reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, February 13, 2017 at 2:40 AM To: kosukeimai/wru wru@noreply.github.com<mailto:wru@noreply.github.com> Cc: Hubert Jin hubertj@exchange.Princeton.EDU<mailto:hubertj@exchange.Princeton.EDU>, Mention mention@noreply.github.com<mailto:mention@noreply.github.com> Subject: Re: [kosukeimai/wru] name.clean --- 2010 data + changing argument name from voter (#11)

Thanks for the suggestions. I'm working on incorporating them in this new branch: https://github.com/kosukeimai/wru/tree/update-surname-handling.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/kosukeimai/wru/issues/11#issuecomment-279314423, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALuyo8HQqNtH3FMFYt-gZdegpLnMMJhLks5rcAjogaJpZM4Lw1E4.

HJ08003 commented 7 years ago

Hello Kabir and Kosuke,

I am in the process of revising the handling of input parameter for the update-surname-handling branch. Once it is done, should I submit the changes directly to the update-surname-handling branch? Thanks,

-Hubert

From: Hubert Jin hubertj@exchange.Princeton.EDU<mailto:hubertj@exchange.Princeton.EDU> Date: Wednesday, February 15, 2017 at 9:53 AM To: Hubert Jin hubertj@exchange.Princeton.EDU<mailto:hubertj@exchange.Princeton.EDU>, Kabir Khanna kkhanna@exchange.Princeton.EDU<mailto:kkhanna@exchange.Princeton.EDU> Cc: Mention mention@noreply.github.com<mailto:mention@noreply.github.com>, kosukeimai/wru reply@reply.github.com<mailto:reply@reply.github.com> Subject: Re: [kosukeimai/wru] name.clean --- 2010 data + changing argument name from voter (#11)

Hi Kabir,

When surname.only is set to TRUE, the following will be done

Surname-Only Predictions

if (surname.only == TRUE) { for (k in 1:length(eth)) { voter.file[paste("pred", eth[k], sep = ".")] <- voter.file[paste("p", eth[k], sep = "_")] } pred <- paste("pred", eth, sep = ".") return(voter.file[c(vars.orig, pred)]) }

Now, can this be done only to the portion of the data that does not have corresponding census object?

Alternatively, we could have the following logic built-in for parameters:

(1) All the default settings will guarantee the code will run for race.pred(voters). (2) If the census information is incomplete and conflict, the code will issue a warning and exit (do nothing). The census information is complete if (a) a valid census key is provided, or (2) a census object is provided and that covers all the states involved in the data (voters)

At this moment, the input parameters of race.pred are: function(voter.file, races = c("white", "black", "latino", "asian", "other"), census.surname = TRUE, surname.only = FALSE, census.geo, census.key, demo = FALSE, census.data = NA, party)

Could you suggest the default values for

census.geo (be county?)
census.key (be NULL?)
party (?)

Thanks,

-Hubert

From: Hubert Jin hubertj@exchange.Princeton.EDU<mailto:hubertj@exchange.Princeton.EDU> Date: Tuesday, February 14, 2017 at 3:24 PM To: Kabir Khanna kkhanna@exchange.Princeton.EDU<mailto:kkhanna@exchange.Princeton.EDU> Cc: Hubert Jin hubertj@exchange.Princeton.EDU<mailto:hubertj@exchange.Princeton.EDU>, Mention mention@noreply.github.com<mailto:mention@noreply.github.com>, kosukeimai/wru reply@reply.github.com<mailto:reply@reply.github.com> Subject: Re: [kosukeimai/wru] name.clean --- 2010 data + changing argument name from voter (#11)

Hi Kabir and Kosuke,

I did the test, here are my findings:

(0) The code works when the parameters are provided correctly, based on the testing I did. The script is included below. (1) Some documentation need to be updated accordingly: README.Md, and the documentation/example for the external functions. (2) In race.pred, it will issue a warning and set surname.only <- T, when the census key is not provided. So, even when census object is provided and its content covers the states involved in the data, it will still do that. Thus the result will be different. I think the decision (issue a warning and set surname.only <- T) should be postpone to later stage when it is clear that census object does not cover a particular state – which should be case by case. For example, the test sample data has DC/NJ/NY. If a census object only covers corresponding NY/DC, surname.only <- T should be only applied to NJ portion of data.

We need to decide on the solution of (2). After that, we can update all (1) accordingly.

Thanks,

-Hubert

k = “your key"

y <- getCensusData(k, state = c("NY", "DC"), demo = FALSE)

yy <- getCensusData(k, state = c("DE", "NJ"), demo = FALSE)

yyy <- y

yyy[["NJ"]] <- yy[["NJ"]]

yyyy <- getCensusData(k, state = "FL", demo = FALSE)

yyyy[["NJ"]] <- yy[["NJ"]]

yyyy[["NY"]] <- y[["NY"]]

yyyy[["DC"]] <- y[["DC"]]

data(voters)

as.character(unique(voters$state))

names(y)

names(yy)

names(yyy)

names(yyyy)

testing the tract related code (x4 vs x5 is the effect of census key not provided when there is no need)

#

x0 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.key = k, party = "PID")

x1 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.key = k, census.data = y, party = "PID")

x2 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.key = k, census.data = yy, party = "PID")

x3 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.key = k, census.data = yyy, party = "PID")

x4 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.key = k, census.data = yyyy, party = "PID")

x5 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.data = yyyy, party = "PID")

testing the block related code (x4 vs x5 is the effect of census key not provided when there is no need)

#

x0 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.key = k, party = "PID")

x1 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.key = k, census.data = y, party = "PID")

x2 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.key = k, census.data = yy, party = "PID")

x3 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.key = k, census.data = yyy, party = "PID")

x4 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.key = k, census.data = yyyy, party = "PID")

x5 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.data = yyyy, party = "PID")

The following is to compare result from each of the above group. (0, 165, 170, 0) means the result is same in prediction probability, and in the number of NAs.

#

x <- x1

sum(x != x0, na.rm=T)

sum(x == x0, na.rm=T)

sum(is.na(x) == is.na(x0), na.rm=T)

sum(is.na(x) != is.na(x0), na.rm=T)

x <- x2

sum(x != x0, na.rm=T)

sum(x == x0, na.rm=T)

sum(is.na(x) == is.na(x0), na.rm=T)

sum(is.na(x) != is.na(x0), na.rm=T)

x <- x3

sum(x != x0, na.rm=T)

sum(x == x0, na.rm=T)

sum(is.na(x) == is.na(x0), na.rm=T)

sum(is.na(x) != is.na(x0), na.rm=T)

x <- x4

sum(x != x0, na.rm=T)

sum(x == x0, na.rm=T)

sum(is.na(x) == is.na(x0), na.rm=T)

sum(is.na(x) != is.na(x0), na.rm=T)

x <- x5

sum(x != x0, na.rm=T)

sum(x == x0, na.rm=T)

sum(is.na(x) == is.na(x0), na.rm=T)

sum(is.na(x) != is.na(x0), na.rm=T)

From: kkprinceton notifications@github.com<mailto:notifications@github.com> Reply-To: kosukeimai/wru reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, February 13, 2017 at 2:40 AM To: kosukeimai/wru wru@noreply.github.com<mailto:wru@noreply.github.com> Cc: Hubert Jin hubertj@exchange.Princeton.EDU<mailto:hubertj@exchange.Princeton.EDU>, Mention mention@noreply.github.com<mailto:mention@noreply.github.com> Subject: Re: [kosukeimai/wru] name.clean --- 2010 data + changing argument name from voter (#11)

Thanks for the suggestions. I'm working on incorporating them in this new branch: https://github.com/kosukeimai/wru/tree/update-surname-handling.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/kosukeimai/wru/issues/11#issuecomment-279314423, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALuyo8HQqNtH3FMFYt-gZdegpLnMMJhLks5rcAjogaJpZM4Lw1E4.

kosukeimai commented 7 years ago

create a new branch off from it and then make a pull request

kkprinceton commented 7 years ago

We have incorporated @soodoku's suggestions. The 2010 surname list is now the default, but there is an option to use the 2000 list. There is a new clean.surnames option (default = TRUE) that cleans surnames before matching them to the list. And the old voters argument is now just called voter.file. Additionally, we have tried to improve error handling and simplify print messages.