palewire / django-calaccess-campaign-browser

A Django app to refine, review and republish campaign finance data drawn from the California Secretary of State’s CAL-ACCESS database
http://django-calaccess-campaign-browser.californiacivicdata.org
MIT License
17 stars 12 forks source link

creating a names and identities table #219

Open rkiddy opened 9 years ago

rkiddy commented 9 years ago

Not yet ready to merge, but I thought it worth sharing. The filling of the names table actually happens in a reasonable amount of time.

I was creating a "name" column by using concat() on namt, ' ', namf, ' ', naml, ' ', nams. It was amazingly hard to figure out a quick way to get double spaces out of just under 20 million rows. So, I am not doing that. I have 4 things that could be equal to '' or not equal to ''. So, I do a filter on each of the 15 combinations of those values. This is the count of all combinations of the four, minus the one where:

namt = '' and namf = '' and naml = '' and nams = ''

I am now running a single SQL statement to fill the identities table with the distinct values of the name column from the names table. That is actually quick to do.

And now I am running a single SQL statement to join the rows in the identity table and names table that are identical. The single-statement form of that is taking forrrrrrrrrever.

coveralls commented 9 years ago

Coverage Status

Coverage remained the same at 32.14% when pulling 362700e08603a39d2064139363e9a03842d34466 on rkiddy:name-and-identities-table-support into 326053dde103058d9571a0f5f18521b05ccdc60c on california-civic-data-coalition:master.