This PR introduces yet more tweaks to improve performance of household inference in households.py.
Summary:
Only read the pii columns required
Keep the "exploded address" columns in a separate DataFrame so we can delete the whole thing once those columns are no longer needed
Dump the household matched pairs to a file so that we can restart from there if the process runs out of memory
Add a --pairsfile arg to specify that pairs file to restart from there
Write the household_pii and mapping files at the same time, don't just store the household_pii file in an array to write later
Add some additional debug statements
Keep using MultiIndexes wherever possible rather than converting to lists of tuples because the performance seems to be better all around
Add the [extras] to the textdistance dependency because it includes additional libraries that speed up, ex, jarowinkler. In my testing this was about a 25% speedup.
Delete objects and aggressively GC when they are no longer needed
This PR introduces yet more tweaks to improve performance of household inference in
households.py
.Summary:
--pairsfile
arg to specify that pairs file to restart from there[extras]
to the textdistance dependency because it includes additional libraries that speed up, ex, jarowinkler. In my testing this was about a 25% speedup.