mitre / data-owner-tools

Tools for the Childhood Obesity Data Initiative (CODI) data owners and partners to use in record linkage
Apache License 2.0
5 stars 8 forks source link

Household inference improvements by splitting data #55

Closed dehall closed 1 year ago

dehall commented 1 year ago

Improves the performance of household inference by splitting data into chunks. Because the inference process is O(n^2), it can try to create huge objects in memory and then run out and crash. By splitting the data into chunks, we can reduce the overall memory requirements. The number of chunks is configurable with a command line arg and defaults to 4. (In a perfect world I would have some guidance on how to choose a split factor based on the size of your data and your system configuration, but sadly I don't have time for that level of analysis. I recommend testing and increasing the size until it doesn't crash due to out-of-memory)