rjweiss / CaliforniaGreatRegister

0 stars 1 forks source link

No Names in 1912-1928 San Bernadino files #20

Open bspahn opened 8 years ago

bspahn commented 8 years ago

The name field is very rarely populated. I'm using...

~/Box Sync/CaliforniaGreatRegisters/staging_data/sanbernardino_successes.txt

rjweiss commented 8 years ago

I just reran the extraction job again with latest version of the code. Can you confirm that you still see this problem? I overwrote that data and there should now be a file with 148907 rows in it.

rjweiss commented 8 years ago

From now on refer to files in ~/Box Sync/CaliforniaGreatRegisters/working_data/sanbernardino_successes.txt

rjweiss commented 8 years ago
data = read_csv('~/Box Sync/CaliforniaGreatRegisters/staging_data/sanbernardino_successes.txt')
years = read_csv('/Users/rweiss/Documents/Stanford/ancestry/CaliforniaGreatRegister/year_dates.csv')
sbyears = years[years$county %in% 'sanbernardino',]
sum(is.na(data$name)) / dim(data)[1] # Fix the earlier rolls
name_rates = ddply(data, .(rollnum), function(df) {
  sum(is.na(df$name)) / length(df$name)
})
name_rates = join(name_rates, sbyears, by='rollnum')
ggplot(name_rates, aes(x=year, y=V1)) + geom_point() + scale_y_continuous(limits=c(0,1))

sbnamefails