Not yet ready to merge, but I thought it worth sharing. The filling of the names table actually happens in a reasonable amount of time.
I was creating a "name" column by using concat() on namt, ' ', namf, ' ', naml, ' ', nams. It was amazingly hard to figure out a quick way to get double spaces out of just under 20 million rows. So, I am not doing that. I have 4 things that could be equal to '' or not equal to ''. So, I do a filter on each of the 15 combinations of those values. This is the count of all combinations of the four, minus the one where:
namt = '' and namf = '' and naml = '' and nams = ''
I am now running a single SQL statement to fill the identities table with the distinct values of the name column from the names table. That is actually quick to do.
And now I am running a single SQL statement to join the rows in the identity table and names table that are identical. The single-statement form of that is taking forrrrrrrrrever.
Coverage remained the same at 32.14% when pulling 362700e08603a39d2064139363e9a03842d34466 on rkiddy:name-and-identities-table-support into 326053dde103058d9571a0f5f18521b05ccdc60c on california-civic-data-coalition:master.
Not yet ready to merge, but I thought it worth sharing. The filling of the names table actually happens in a reasonable amount of time.
I was creating a "name" column by using concat() on namt, ' ', namf, ' ', naml, ' ', nams. It was amazingly hard to figure out a quick way to get double spaces out of just under 20 million rows. So, I am not doing that. I have 4 things that could be equal to '' or not equal to ''. So, I do a filter on each of the 15 combinations of those values. This is the count of all combinations of the four, minus the one where:
I am now running a single SQL statement to fill the identities table with the distinct values of the name column from the names table. That is actually quick to do.
And now I am running a single SQL statement to join the rows in the identity table and names table that are identical. The single-statement form of that is taking forrrrrrrrrever.