kosukeimai / wru

Who Are You? Bayesian Prediction of Racial Category Using Surname and Geolocation
129 stars 30 forks source link

Error result when voter file populations are located in zero population tracts #151

Open csfowler opened 3 months ago

csfowler commented 3 months ago

There are special tracts (often with tract id codes in the 98---- range) that denote low population areas like airports, water bodies, parks, etc (census documentation here https://www2.census.gov/geo/pdfs/partnerships/psap/G-650.pdf). People can still live here (and they do), but especially with differential privacy there is a good chance these tracts show as zero population. I am running a voter file that breaks because the assigned race probability for other is NaN. Essentially the zero value gets passed all through the other racial categories and then breaks when the pr.other value is being calculated. To be clear, these are people located within valid census geographies, so skip.bad.geos doesn't move past them. I have to believe that this happens with some regularity at the smaller geographies as well. I would hope that the non-exclusion component that gets added in the fBISG framework would handle this possibility, but it seems to break on this use case. I am running into the problem using North Carolina data from L2 which I think was used to test the package initially anyway, so maybe a solution already exists?

1beb commented 3 months ago

I've also run into this one. I excluded them from the sample I was trying to predict race for and added them back in with the state (or closest geography) race distribution for the extremely small sample of people that were living there. You should know too that the census data will be not be robust in those locations because of differential privacy. There's a recent paper that looks at this a bit too. https://www.science.org/doi/10.1126/sciadv.adl2524. You'll have to make a judgement call.

Understanding this, how would you have expected the package to behave?

csfowler commented 3 months ago

I am shifting people to the nearest populated tract, but not sure how I feel about that. What you are doing sounds equally defensible. I tend to think of population sorting at the sub-county level as pretty important, so I don't like jumping up a scale, but it is certainly no better or worse than what I am doing. I am aware of the differential privacy stuff; I work with individual census responses and small geographies all the time and it is a mess at these scales. As for expected behaviour... if there is no information about race that can be generated from the census tract data, then I would default to the model prediction made without the benefit of the census data (e.g. name only) and report the number of records adjudicated in this way. Not sure how reasonable this is within the workflow of calculations though. Thanks for the quick response today and yesterday.

1beb commented 3 months ago

fBISG defaults to national race distribution (which is uses as a prior). I don't know if that would be desirable behavior in some of these higher tract ranges. I think some of them also count military and national parks. We will have to think about this a bit. @solivella do you have an opinion/experience here?

solivella commented 3 months ago

Equation 6 of https://www.science.org/doi/10.1126/sciadv.adc9824 implies that, in instances in which the census counts are zero (for whatever reason), the predicted probability will at least default to the name-only race probability. If the name is not in the dictionary, then the software (is expected to) default to the national race distribution, as @1beb mentioned. Can you confirm whether the NaN's occur for records whose names are/are not in the name-race dictionary you are using, @csfowler?

csfowler commented 3 months ago

I can't confirm explicitly as I don't see match status in the output while stepping through in debug mode. A few look like they might have one or two of their names missing, but most should have all three in the register. Not comfortable putting the example names here for all the world to see.

solivella commented 3 months ago

Sorry, I should have been clearer, @csfowler. I wouldn't expect you to post names here. If you are not providing dictionaries of your own, you can

  1. Filter the results of predict_race() by something like %>% filter(!is.finite(pred.oth))
  2. Manually check if all of the filtered records have the names you are using for prediction (i.e., surname, surname, first, or surname, first, middle) present in the internal name-race dictionaries, which can be found here for surnames, here for first, and here for middle. You can do this easily with a call to %in%, and summarize with a table(). Again, no need to share any names.

predict_race() should always return a finite value if impute.missing=TRUE (the default). So the behavior you are seeing is not expected. The steps above can help determine where the NaN is coming from. If all records with NaN in the Other category are names that are not in the name-race dictionary, that's an easy fix. If they are not, then the issue will be a bit more lower level. What you find through the steps above would help us figure out where to start looking.

csfowler commented 3 months ago

I think I can see where the expected behavior is not emerging. I can't do what you asked since predict_race isn't finishing to give me the result. Let me walk through it.

The error is triggered by NA's in the race.init run of the BISG model where impute.missing is set to true. The stop message indicates it is probably bad geography, but I have checked for that. Here is the workflow I have so far:

Prior to running the model I concatenate county and tract in my voter file (vf) to create a new column FIPS. I do the same thing with county and tract in 'cen' the object returned by cen <- get_census_data(key, states="NC", year=2020, census.geo='tract')

summary( vf$FIPS %in% cen$FIPS) shows that all of my geographic units in the voter file can also be found in the census geography file I will use in my function call.

Then I call: predictions <- predict_race(voter.file = vf, names.to.use = 'surname, first, middle', census.geo = 'tract', census.data = cen, impute.missing = TRUE, year=2020, model='fBISG' )

stepping through that function call in debug mode I can see that the vector returned from the internal call to predict_race (race.init) has 57 NA's in it. To be clear, this is the call to the BISG model, but it is run with impute.missing hard coded to TRUE, so that may be some lower level functionality not working as expected. Looking into the NA values, they are all generated by tracts that have zero population in cen. From here the code triggers a stop because of the NA's in race.init. If, still in debug mode, I manually change the 57 values to '1' then the function seems to proceed as expected (still running) Hope this helps.

solivella commented 3 months ago

@csfowler this is super helpful. Thank you! Can I ask that you use trace(wru:::predict_race_new, at=19) before you call predict_race() in your workflow, and do two things right as you step through the browser that comes up?

  1. Do a table(miss_ind) right after it gets a value in the line evaluated immediately after the browser begins. Compare that to table(!is.finite(preds$c_oth_last), and let us know if the two tables are returning the same numbers.
  2. Check the head(voter.file[miss_ind, grep("_last$", names(voter.file))]); do you see missing values in that print out?
csfowler commented 3 months ago

I get 62 TRUE on the table(miss_ind) and 119 TRUE on the table(!is.finite(preds$c_oth_last) the difference is the 57 problem values FWIW I can see the Inf values attached to voter.file when it comes out of census_helper_new

solivella commented 3 months ago

Thank you, @csfowler! This is exactly the information I needed to pinpoint the issue. I will have an update for you on this tomorrow (along with a PR to fix it). Please stand by.

1beb commented 3 months ago

Thank you both. This is on it's way to CRAN. For now, you can use remotes::install_github("kosukeimai/wru") as that will use the bugfix version. Closing this now :+1:

csfowler commented 3 months ago

Close, but not quite. The race.init vector still triggers the stop condition because the results in pred.oth are not NA but Inf. (Haven't figured that out yet. When you do dplyr::coalesce( ) to get rid of the NA values it does not treat Inf as a missing value, and so the Inf is retained. Can I suggest a step such as here (https://stackoverflow.com/questions/12188509/cleaning-inf-values-from-an-r-dataframe) that converts Inf. to NA first?

solivella commented 3 months ago

@csfowler just confirming that your original statement '...that breaks because the assigned race probability for other is NaN' is now incorrect, as you are seeing an Inf instead of an NaN. Is that right?

csfowler commented 3 months ago

Yes. At the very end of the function it was breaking because of a returned NaN, but within predict race, and specifically as an output of the function named earlier in the thread (get_census-new? I am on my phone and can’t see it) the value assigned to oth. Was Inf. so when you use coalesce it accepts inf. as a numeric value.Sent from my iPhoneOn May 29, 2024, at 4:18 PM, Santiago Olivella @.***> wrote: @csfowler just confirming that your original statement '...that breaks because the assigned race probability for other is NaN' is now incorrect, as you are seeing an Inf instead of an NaN. Is that right?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

csfowler commented 2 months ago

pretty sure I have found the problem, and it might actually be a pretty important one with broader implications. Look in the function census_helpernew and the line where geoPopulations is created (118 in my browser) geoPopulations <- rowSums(census[, names(census) %in% vars]) vars is a list. When I do names(census) %in% vars by itself, the only columns I get are the first three (white black and hispanic) that only have one variable in them. asian and other both return false. This means that geopopulations will be zero for a tract that only has asian and other people in it setting up the conditions to return Inf. by changing the line to read geoPopulations <- rowSums(census[, names(census) %in% unlist(vars_)]) the problem is solved. Sorry I am not literate enough to just do this as a git thing

csfowler commented 2 months ago

In the above comment I figured out how to get census_helper_new to produce the expected outcomes. The predict_race function is still failing for me because of this section of code in predict_race_new: else { preds <- voter.file[, grep("_last$", names(voter.file))] * voter.file[, grep("^r_", names(voter.file))] if (grepl("first", names.to.use)) { preds <- preds * voter.file[, grep("_first$", names(voter.file))] } if (grepl("middle", names.to.use)) { preds <- preds * voter.file[, grep("_middle$", names(voter.file))] } }

by the end of this chunk of code some of the rows in preds are all zeroes. The next chunk of code looks like: if (impute.missing) { for (i in ncol(preds)) { preds[, i] <- dplyr::coalesce(preds[, i], race.margin[i]) } } preds <- preds/rowSums(preds)

impute missing doesn't do anything because it understands the 0's as numbers and then the preds/rowSums(preds) produces some NaNs then the voter.file object that is returned from this function has NaNs in it and causes a failure based on NaNs in the initial race values.

It seems likely to me that the first chunk above will produce all zeroes with some regularity. It is actually amazing to me that this doesn't happen more often. With some unusual last names you will get a 0% likelihood for several races coming our of the census and surname dictionaries. If the person happens to have a first name that is also unusual, but for a different racial group then you get a row of zeroes.

A proposed fix would be to have a check prior to the for loop in the if(impute.missing){ such that preds[rowSums(preds)==0,]<-race.margin which would essentially use the base race.margin if we don't have any information to go on.