Closed soodoku closed 7 years ago
@HJ08003 Can you check out the update posted in the update-surname-handling
branch? In particular, if you can do some testing to make sure that there is no bug, that would be great. If you find any issues, please post it here. Thanks.
Thanks for the suggestions. I'm working on incorporating them in this new branch: https://github.com/kosukeimai/wru/tree/update-surname-handling.
Hi Kabir and Kosuke,
I did the test, here are my findings:
We need to decide on the solution on race.pred. After that, we can update all the documentation accordingly.
Thanks,
Hubert
@HJ08003 you may want to edit to remove the census API key. @kosukeimai caught me doing that once for a pull request also
Yes, @HJ08003 Please do not use the census key in a public place like this. Also, use the markdown grammar so that your comments are easier to read.
Hi Kabir,
When surname.only is set to TRUE, the following will be done
if (surname.only == TRUE) { for (k in 1:length(eth)) { voter.file[paste("pred", eth[k], sep = ".")] <- voter.file[paste("p", eth[k], sep = "_")] } pred <- paste("pred", eth, sep = ".") return(voter.file[c(vars.orig, pred)]) }
Now, can this be done only to the portion of the data that does not have corresponding census object?
Alternatively, we could have the following logic built-in for parameters:
(1) All the default settings will guarantee the code will run for race.pred(voters). (2) If the census information is incomplete and conflict, the code will issue a warning and exit (do nothing). The census information is complete if (a) a valid census key is provided, or (2) a census object is provided and that covers all the states involved in the data (voters)
At this moment, the input parameters of race.pred are: function(voter.file, races = c("white", "black", "latino", "asian", "other"), census.surname = TRUE, surname.only = FALSE, census.geo, census.key, demo = FALSE, census.data = NA, party)
Could you suggest the default values for
census.geo (be county?)
census.key (be NULL?)
party (?)
Thanks,
-Hubert
From: Hubert Jin hubertj@exchange.Princeton.EDU<mailto:hubertj@exchange.Princeton.EDU> Date: Tuesday, February 14, 2017 at 3:24 PM To: Kabir Khanna kkhanna@exchange.Princeton.EDU<mailto:kkhanna@exchange.Princeton.EDU> Cc: Hubert Jin hubertj@exchange.Princeton.EDU<mailto:hubertj@exchange.Princeton.EDU>, Mention mention@noreply.github.com<mailto:mention@noreply.github.com>, kosukeimai/wru reply@reply.github.com<mailto:reply@reply.github.com> Subject: Re: [kosukeimai/wru] name.clean --- 2010 data + changing argument name from voter (#11)
Hi Kabir and Kosuke,
I did the test, here are my findings:
(0) The code works when the parameters are provided correctly, based on the testing I did. The script is included below. (1) Some documentation need to be updated accordingly: README.Md, and the documentation/example for the external functions. (2) In race.pred, it will issue a warning and set surname.only <- T, when the census key is not provided. So, even when census object is provided and its content covers the states involved in the data, it will still do that. Thus the result will be different. I think the decision (issue a warning and set surname.only <- T) should be postpone to later stage when it is clear that census object does not cover a particular state – which should be case by case. For example, the test sample data has DC/NJ/NY. If a census object only covers corresponding NY/DC, surname.only <- T should be only applied to NJ portion of data.
We need to decide on the solution of (2). After that, we can update all (1) accordingly.
Thanks,
-Hubert
k = “your key"
y <- getCensusData(k, state = c("NY", "DC"), demo = FALSE)
yy <- getCensusData(k, state = c("DE", "NJ"), demo = FALSE)
yyy <- y
yyy[["NJ"]] <- yy[["NJ"]]
yyyy <- getCensusData(k, state = "FL", demo = FALSE)
yyyy[["NJ"]] <- yy[["NJ"]]
yyyy[["NY"]] <- y[["NY"]]
yyyy[["DC"]] <- y[["DC"]]
data(voters)
as.character(unique(voters$state))
names(y)
names(yy)
names(yyy)
names(yyyy)
#
x0 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.key = k, party = "PID")
x1 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.key = k, census.data = y, party = "PID")
x2 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.key = k, census.data = yy, party = "PID")
x3 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.key = k, census.data = yyy, party = "PID")
x4 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.key = k, census.data = yyyy, party = "PID")
x5 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.data = yyyy, party = "PID")
#
x0 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.key = k, party = "PID")
x1 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.key = k, census.data = y, party = "PID")
x2 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.key = k, census.data = yy, party = "PID")
x3 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.key = k, census.data = yyy, party = "PID")
x4 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.key = k, census.data = yyyy, party = "PID")
x5 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.data = yyyy, party = "PID")
#
x <- x1
sum(x != x0, na.rm=T)
sum(x == x0, na.rm=T)
sum(is.na(x) == is.na(x0), na.rm=T)
sum(is.na(x) != is.na(x0), na.rm=T)
x <- x2
sum(x != x0, na.rm=T)
sum(x == x0, na.rm=T)
sum(is.na(x) == is.na(x0), na.rm=T)
sum(is.na(x) != is.na(x0), na.rm=T)
x <- x3
sum(x != x0, na.rm=T)
sum(x == x0, na.rm=T)
sum(is.na(x) == is.na(x0), na.rm=T)
sum(is.na(x) != is.na(x0), na.rm=T)
x <- x4
sum(x != x0, na.rm=T)
sum(x == x0, na.rm=T)
sum(is.na(x) == is.na(x0), na.rm=T)
sum(is.na(x) != is.na(x0), na.rm=T)
x <- x5
sum(x != x0, na.rm=T)
sum(x == x0, na.rm=T)
sum(is.na(x) == is.na(x0), na.rm=T)
sum(is.na(x) != is.na(x0), na.rm=T)
From: kkprinceton notifications@github.com<mailto:notifications@github.com> Reply-To: kosukeimai/wru reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, February 13, 2017 at 2:40 AM To: kosukeimai/wru wru@noreply.github.com<mailto:wru@noreply.github.com> Cc: Hubert Jin hubertj@exchange.Princeton.EDU<mailto:hubertj@exchange.Princeton.EDU>, Mention mention@noreply.github.com<mailto:mention@noreply.github.com> Subject: Re: [kosukeimai/wru] name.clean --- 2010 data + changing argument name from voter (#11)
Thanks for the suggestions. I'm working on incorporating them in this new branch: https://github.com/kosukeimai/wru/tree/update-surname-handling.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/kosukeimai/wru/issues/11#issuecomment-279314423, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALuyo8HQqNtH3FMFYt-gZdegpLnMMJhLks5rcAjogaJpZM4Lw1E4.
Hello Kabir and Kosuke,
I am in the process of revising the handling of input parameter for the update-surname-handling branch. Once it is done, should I submit the changes directly to the update-surname-handling branch? Thanks,
-Hubert
From: Hubert Jin hubertj@exchange.Princeton.EDU<mailto:hubertj@exchange.Princeton.EDU> Date: Wednesday, February 15, 2017 at 9:53 AM To: Hubert Jin hubertj@exchange.Princeton.EDU<mailto:hubertj@exchange.Princeton.EDU>, Kabir Khanna kkhanna@exchange.Princeton.EDU<mailto:kkhanna@exchange.Princeton.EDU> Cc: Mention mention@noreply.github.com<mailto:mention@noreply.github.com>, kosukeimai/wru reply@reply.github.com<mailto:reply@reply.github.com> Subject: Re: [kosukeimai/wru] name.clean --- 2010 data + changing argument name from voter (#11)
Hi Kabir,
When surname.only is set to TRUE, the following will be done
if (surname.only == TRUE) { for (k in 1:length(eth)) { voter.file[paste("pred", eth[k], sep = ".")] <- voter.file[paste("p", eth[k], sep = "_")] } pred <- paste("pred", eth, sep = ".") return(voter.file[c(vars.orig, pred)]) }
Now, can this be done only to the portion of the data that does not have corresponding census object?
Alternatively, we could have the following logic built-in for parameters:
(1) All the default settings will guarantee the code will run for race.pred(voters). (2) If the census information is incomplete and conflict, the code will issue a warning and exit (do nothing). The census information is complete if (a) a valid census key is provided, or (2) a census object is provided and that covers all the states involved in the data (voters)
At this moment, the input parameters of race.pred are: function(voter.file, races = c("white", "black", "latino", "asian", "other"), census.surname = TRUE, surname.only = FALSE, census.geo, census.key, demo = FALSE, census.data = NA, party)
Could you suggest the default values for
census.geo (be county?)
census.key (be NULL?)
party (?)
Thanks,
-Hubert
From: Hubert Jin hubertj@exchange.Princeton.EDU<mailto:hubertj@exchange.Princeton.EDU> Date: Tuesday, February 14, 2017 at 3:24 PM To: Kabir Khanna kkhanna@exchange.Princeton.EDU<mailto:kkhanna@exchange.Princeton.EDU> Cc: Hubert Jin hubertj@exchange.Princeton.EDU<mailto:hubertj@exchange.Princeton.EDU>, Mention mention@noreply.github.com<mailto:mention@noreply.github.com>, kosukeimai/wru reply@reply.github.com<mailto:reply@reply.github.com> Subject: Re: [kosukeimai/wru] name.clean --- 2010 data + changing argument name from voter (#11)
Hi Kabir and Kosuke,
I did the test, here are my findings:
(0) The code works when the parameters are provided correctly, based on the testing I did. The script is included below. (1) Some documentation need to be updated accordingly: README.Md, and the documentation/example for the external functions. (2) In race.pred, it will issue a warning and set surname.only <- T, when the census key is not provided. So, even when census object is provided and its content covers the states involved in the data, it will still do that. Thus the result will be different. I think the decision (issue a warning and set surname.only <- T) should be postpone to later stage when it is clear that census object does not cover a particular state – which should be case by case. For example, the test sample data has DC/NJ/NY. If a census object only covers corresponding NY/DC, surname.only <- T should be only applied to NJ portion of data.
We need to decide on the solution of (2). After that, we can update all (1) accordingly.
Thanks,
-Hubert
k = “your key"
y <- getCensusData(k, state = c("NY", "DC"), demo = FALSE)
yy <- getCensusData(k, state = c("DE", "NJ"), demo = FALSE)
yyy <- y
yyy[["NJ"]] <- yy[["NJ"]]
yyyy <- getCensusData(k, state = "FL", demo = FALSE)
yyyy[["NJ"]] <- yy[["NJ"]]
yyyy[["NY"]] <- y[["NY"]]
yyyy[["DC"]] <- y[["DC"]]
data(voters)
as.character(unique(voters$state))
names(y)
names(yy)
names(yyy)
names(yyyy)
#
x0 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.key = k, party = "PID")
x1 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.key = k, census.data = y, party = "PID")
x2 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.key = k, census.data = yy, party = "PID")
x3 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.key = k, census.data = yyy, party = "PID")
x4 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.key = k, census.data = yyyy, party = "PID")
x5 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "tract", census.data = yyyy, party = "PID")
#
x0 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.key = k, party = "PID")
x1 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.key = k, census.data = y, party = "PID")
x2 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.key = k, census.data = yy, party = "PID")
x3 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.key = k, census.data = yyy, party = "PID")
x4 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.key = k, census.data = yyyy, party = "PID")
x5 <- race.pred(voter.file = voters, races = c("white", "black", "latino", "asian", "other"), census.geo = "block", census.data = yyyy, party = "PID")
#
x <- x1
sum(x != x0, na.rm=T)
sum(x == x0, na.rm=T)
sum(is.na(x) == is.na(x0), na.rm=T)
sum(is.na(x) != is.na(x0), na.rm=T)
x <- x2
sum(x != x0, na.rm=T)
sum(x == x0, na.rm=T)
sum(is.na(x) == is.na(x0), na.rm=T)
sum(is.na(x) != is.na(x0), na.rm=T)
x <- x3
sum(x != x0, na.rm=T)
sum(x == x0, na.rm=T)
sum(is.na(x) == is.na(x0), na.rm=T)
sum(is.na(x) != is.na(x0), na.rm=T)
x <- x4
sum(x != x0, na.rm=T)
sum(x == x0, na.rm=T)
sum(is.na(x) == is.na(x0), na.rm=T)
sum(is.na(x) != is.na(x0), na.rm=T)
x <- x5
sum(x != x0, na.rm=T)
sum(x == x0, na.rm=T)
sum(is.na(x) == is.na(x0), na.rm=T)
sum(is.na(x) != is.na(x0), na.rm=T)
From: kkprinceton notifications@github.com<mailto:notifications@github.com> Reply-To: kosukeimai/wru reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, February 13, 2017 at 2:40 AM To: kosukeimai/wru wru@noreply.github.com<mailto:wru@noreply.github.com> Cc: Hubert Jin hubertj@exchange.Princeton.EDU<mailto:hubertj@exchange.Princeton.EDU>, Mention mention@noreply.github.com<mailto:mention@noreply.github.com> Subject: Re: [kosukeimai/wru] name.clean --- 2010 data + changing argument name from voter (#11)
Thanks for the suggestions. I'm working on incorporating them in this new branch: https://github.com/kosukeimai/wru/tree/update-surname-handling.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/kosukeimai/wru/issues/11#issuecomment-279314423, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALuyo8HQqNtH3FMFYt-gZdegpLnMMJhLks5rcAjogaJpZM4Lw1E4.
create a new branch off from it and then make a pull request
We have incorporated @soodoku's suggestions. The 2010 surname list is now the default, but there is an option to use the 2000 list. There is a new clean.surnames option (default = TRUE) that cleans surnames before matching them to the list. And the old voters argument is now just called voter.file. Additionally, we have tried to improve error handling and simplify print messages.
Dear All,
A few minor suggestions:
See here for 2010: http://www.census.gov/topics/population/genealogy/data/2010_surnames.html
say there is a last name called 'canes wrone' and we match canes and then we match wrones. potentially you can have two matches in census data. And it may be a good idea to produce multiple left_joins.