rjweiss / CaliforniaGreatRegister

0 stars 1 forks source link

occupations that start with st have st dropped #8

Open bspahn opened 8 years ago

bspahn commented 8 years ago

pagenum rollnum name address occupation pid county yr mrs miss fem recordnum lastname firstname 10 21 52 Seward Miss Glennis E 1458 Madison st st enographer Dem alameda 1940 FALSE TRUE TRUE 54275 seward glennis 12 762 47 Seward Miss Mary A 14 Scenic ave st udent Rep alameda 1940 FALSE TRUE TRUE 1137339 seward mary

rjweiss commented 8 years ago

I think adding pagetext = re.sub(r'st\s*udent', 'student', pagetext)

to rules.extract_newlines() would fix some of this.