rjweiss / CaliforniaGreatRegister

0 stars 1 forks source link

Good Rows Being Dropped #22

Open bspahn opened 8 years ago

bspahn commented 8 years ago

Alameda_cleaned_v2.txt looks to have 85% more records than the new Alameda file in working_data.

The rows that are getting dropped don't seem like "bad" rows either.

To get a sense of # of pages being dropped:

require(ggplot2)
require(dplyr)
require(readr)
old <- read_csv("~/Box Sync/CaliforniaGreatRegisters/alameda_cleaned_v2.txt")
old.n <- alameda %>% group_by(rollnum) %>% summarise(n.alameda_cleaned_v2.txt = length(pid))

new <- read_csv("~/Box Sync/CaliforniaGreatRegisters/working_data/alameda_successes.txt")
new.n <- dat %>% filter(county=="alameda") %>% group_by(rollnum) %>%  summarise(n.alameda_successes.txt = length(pid))
plot.df <- left_join(old.n,new.n)
ggplot(plot.df, aes(x=n.alameda_successes.txt, y = n.alameda_cleaned_v2.txt, label = rollnum)) + geom_text() + coord_equal() + geom_abline(intercept = 0, slope = 1) + theme_bw()

It looks like records are being dropped even within the same page. For instance, look at the Rubenstein's. They appear in the old file, but not in the new one, despite having a clean looking record.

old %>% data.frame %>% filter(rollnum==29 & pagenum == 52) %>% select(name, address,occupation, pid) new %>% data.frame %>% filter(rollnum==29 & pagenum == 52) %>% select(name, address,occupation, pid)

bspahn commented 8 years ago

image