thinkingmachines / unicef-ai4d-poverty-mapping

UNICEF AI4D Relative Wealth Mapping Project - datasets, models, and scripts for building relative wealth estimation models across Southeast Asia (SEA)
https://thinkingmachines.github.io/unicef-ai4d-poverty-mapping
MIT License
20 stars 8 forks source link

Implementing DHS cross country data manager #109

Closed tm-jc-nacpil closed 1 year ago

tm-jc-nacpil commented 1 year ago

Hi @alronlam ! I'd like to sanity check an approach for returning the cross-country datasets. :D

Pseudocode

class DHSDataManager:
    def_init(self):
        # Initialize storage for processed country datasets
        # separated into household and cluster level
        self.household_data = {
            "ph": <dataframe>,
            "tl": <dataframe>,
            ...    
        }
        self.cluster_data = {
            "ph": <dataframe>,
            "tl": <dataframe>,
            ...    
        }

    def generate_dhs_cluster_level_data(<same parameters as before>, return=True):
        <same logic here>

        # Store the output gdf into cluster_data
        self.cluster_data[country_name] = gdf

        if return:
             return gdf 

    def generate_dhs_household_level_data(<same parameters as before>, return=True):
        <same logic here>

        # Store the output df into household_data
        self.household_data[country_name] = df

        if return:
            return df 

     def get_cluster_level_data_by_country(country_list):
         "Concatenate all cluster level dataframes for each specified country"
         for country in country_list:
             ...
         return concatenated_gdf

     def get_household_level_data_by_country(country_list):
         "Concatenate all household level dataframes for each specified country"
         for country in country_list:
             ...
         return concatenated_df

     def recalculate_index(country_list):
         "Recalculates wealth index based on specified country"
         return df

Ideal usage

# Get cluster level data for four countries  using the data manager
# Using the same parameters as the original generate_dhs_cluster_level_data() function
# This stores the output data manager and optionally returns the corresponding dataframe
dhs_ph_gdf = DHSDataManager.generate_dhs_cluster_level_data(<ph parameters>, return=True)
dhs_tl_gdf = DHSDataManager.generate_dhs_cluster_level_data(<tl parameters>, return=True)
dhs_mm_gdf = DHSDataManager.generate_dhs_cluster_level_data(<mm parameters>, return=True)
dhs_kh_gdf = DHSDataManager.generate_dhs_cluster_level_data(<kh parameters, return=True)

# After loading it in the data manager, it can take care of combining the datasets into country level
country_cluster_data = DHSDataManager.get_cluster_level_data_by_country(<list of countries>)

# Similarly for household data
country_household_data = DHSDataManager.get_household_level_data_by_country(<list of countries>)

# Run PCA and recalculate index accordingly
index = PCA(country_household_data)
alronlam commented 1 year ago

Sounds good! Just one thing, there seems to be a discrepancy with the index re-calculation.

     def recalculate_index(country_list):
         "Recalculates wealth index based on specified country"
         return df

vs

# Run PCA and recalculate index accordingly
index = PCA(country_household_data)

I suggest the former (creating a recalculate index function with a list of countries as param) though for convenience, as I think it's likely to be re-used.