Open rjweiss opened 8 years ago
These years also very rarely have the name field populated.
Also, 1912 is the only year with its own gender field.
No longer see names with row numbers at the beginning of the name field (issue discussed in person).
> data = read_csv('~/Box Sync/CaliforniaGreatRegisters/working_data/alameda_successes.txt')
> alameda_names = data$name
> alameda_names[grepl('^\\d', alameda_names)]
[1] "175 Rose 1r s Iren 1"
[2] "211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 211 212 21 3 21 245 2 6 247 248 249 250 2"
Confirmed distribution of empty name fields. Plot below of number of empty NAs over total number of rows per roll number-year. Appears to mostly be the result of the row number carrying over into an address field.
alameda_names = dplyr::select(data, name, rollnum, pagenum)
namefails = ddply(alameda_names, .(rollnum), summarise,
na = sum(is.na(name)),
n = length(name),
rate = na/n)
namefails = join(alamedayears, namefails, by='rollnum')
ggplot(namefails, aes(x=year, y=rate)) + geom_point()
Fixed the row number transcription error. Empty name field looks much better.
I think your fix pushed the problem into the address field:
dat %>% filter(yr==1916) %>% select(recordnum, pagenum,occupation, name, address) %>% head recordnum pagenum occupation name address 1 72921 9 housewife Ande rrm Mrs J osephine 3234 Enciinul ave 2 72922 9 bridge tender f Arada Seymour foot of Peach st 3 72923 9 housewife i 4 lierach Mrs Emma 1226 High st 4 72924 9 holusewife IR 35ehiergen rs Rehla n 3227 Madisola sat 5 72925 9 niurise l'i'Scliriiidt Jiacob foot of Poech tailor p 137 SMliolz lMrs Itertlih 1 3248 Encinnl ave 6 72926 12 retired Abt Charles 3272 Briggs ave****
Some years have an extra field "gender" for them, which should be counted separately.