Good Rows Being Dropped

Alameda_cleaned_v2.txt looks to have 85% more records than the new Alameda file in working_data.

The rows that are getting dropped don't seem like "bad" rows either.

To get a sense of # of pages being dropped:

require(ggplot2)
require(dplyr)
require(readr)
old <- read_csv("~/Box Sync/CaliforniaGreatRegisters/alameda_cleaned_v2.txt")
old.n <- alameda %>% group_by(rollnum) %>% summarise(n.alameda_cleaned_v2.txt = length(pid))

new <- read_csv("~/Box Sync/CaliforniaGreatRegisters/working_data/alameda_successes.txt")
new.n <- dat %>% filter(county=="alameda") %>% group_by(rollnum) %>%  summarise(n.alameda_successes.txt = length(pid))
plot.df <- left_join(old.n,new.n)
ggplot(plot.df, aes(x=n.alameda_successes.txt, y = n.alameda_cleaned_v2.txt, label = rollnum)) + geom_text() + coord_equal() + geom_abline(intercept = 0, slope = 1) + theme_bw()

It looks like records are being dropped even within the same page. For instance, look at the Rubenstein's. They appear in the old file, but not in the new one, despite having a clean looking record.

old %>% data.frame %>% filter(rollnum==29 & pagenum == 52) %>% select(name, address,occupation, pid) new %>% data.frame %>% filter(rollnum==29 & pagenum == 52) %>% select(name, address,occupation, pid)

rjweiss / CaliforniaGreatRegister

Good Rows Being Dropped #22