This PR merges the feat/clean-data-pipeline branch into main, bringing several key improvements to the data pipeline, particularly around the execution of custom functions for data bucketing. These changes enhance both the functionality and performance of the main script.
Key Updates:
Custom Bucket Functions: Added and refactored key helper functions (range_match_lookup(), value_match_lookup(), join_columns(), create_race_eth_bucket()) to efficiently create custom buckets for continuous and categorical variables like age, income, race, and ethnicity.
Database Optimization: Improved handling of database operations by leveraging DuckDB’s in-memory computations.
Pipeline Performance: Enhanced performance by materializing intermediate steps in the pipeline and carefully managing joins and operations using compute(), while maintaining efficient memory use.
Improved Flexibility: The code is now better structured, supporting future enhancements and integration of additional lookup tables.
Hierarchical Categorization: Implemented hierarchical logic in the create_race_eth_bucket() function, ensuring that race and ethnicity categorization follow a specified order of precedence.
Updated Documentation: Code documentation and Roxygen comments were added/updated to provide clarity on the functions’ purposes and usage.
This PR merges the
feat/clean-data-pipeline
branch into main, bringing several key improvements to the data pipeline, particularly around the execution of custom functions for data bucketing. These changes enhance both the functionality and performance of the main script.Key Updates:
Custom Bucket Functions: Added and refactored key helper functions (
range_match_lookup()
,value_match_lookup()
,join_columns()
,create_race_eth_bucket()
) to efficiently create custom buckets for continuous and categorical variables like age, income, race, and ethnicity.Database Optimization: Improved handling of database operations by leveraging DuckDB’s in-memory computations.
Pipeline Performance: Enhanced performance by materializing intermediate steps in the pipeline and carefully managing joins and operations using
compute()
, while maintaining efficient memory use.Improved Flexibility: The code is now better structured, supporting future enhancements and integration of additional lookup tables.
Hierarchical Categorization: Implemented hierarchical logic in the
create_race_eth_bucket()
function, ensuring that race and ethnicity categorization follow a specified order of precedence.Updated Documentation: Code documentation and Roxygen comments were added/updated to provide clarity on the functions’ purposes and usage.