Implementing DHS cross country data manager

Hi @alronlam ! I'd like to sanity check an approach for returning the cross-country datasets. :D

main idea is to create a class DHSDataManager to help manage the dhs data for multiple countries
when we run generate_cluster/household_level_data, it stores the output in a class, index by country name
If we can store it in this way, we can run the usual dhs_data = DHSDataManager.generate_data() pattern, while also keeping track of all the countries we've performed it on
Afterwards, since it's stored in the class, we can then have functions for returning the concatenated cross-country datasets based on a list of specified countries. We can also apply the necessary preprocessing of the cross-country output in the same functions

Pseudocode

class DHSDataManager:
    def_init(self):
        # Initialize storage for processed country datasets
        # separated into household and cluster level
        self.household_data = {
            "ph": <dataframe>,
            "tl": <dataframe>,
            ...    
        }
        self.cluster_data = {
            "ph": <dataframe>,
            "tl": <dataframe>,
            ...    
        }

    def generate_dhs_cluster_level_data(<same parameters as before>, return=True):
        <same logic here>

        # Store the output gdf into cluster_data
        self.cluster_data[country_name] = gdf

        if return:
             return gdf 

    def generate_dhs_household_level_data(<same parameters as before>, return=True):
        <same logic here>

        # Store the output df into household_data
        self.household_data[country_name] = df

        if return:
            return df 

     def get_cluster_level_data_by_country(country_list):
         "Concatenate all cluster level dataframes for each specified country"
         for country in country_list:
             ...
         return concatenated_gdf

     def get_household_level_data_by_country(country_list):
         "Concatenate all household level dataframes for each specified country"
         for country in country_list:
             ...
         return concatenated_df

     def recalculate_index(country_list):
         "Recalculates wealth index based on specified country"
         return df

Ideal usage

# Get cluster level data for four countries  using the data manager
# Using the same parameters as the original generate_dhs_cluster_level_data() function
# This stores the output data manager and optionally returns the corresponding dataframe
dhs_ph_gdf = DHSDataManager.generate_dhs_cluster_level_data(<ph parameters>, return=True)
dhs_tl_gdf = DHSDataManager.generate_dhs_cluster_level_data(<tl parameters>, return=True)
dhs_mm_gdf = DHSDataManager.generate_dhs_cluster_level_data(<mm parameters>, return=True)
dhs_kh_gdf = DHSDataManager.generate_dhs_cluster_level_data(<kh parameters, return=True)

# After loading it in the data manager, it can take care of combining the datasets into country level
country_cluster_data = DHSDataManager.get_cluster_level_data_by_country(<list of countries>)

# Similarly for household data
country_household_data = DHSDataManager.get_household_level_data_by_country(<list of countries>)

# Run PCA and recalculate index accordingly
index = PCA(country_household_data)

thinkingmachines / unicef-ai4d-poverty-mapping

Implementing DHS cross country data manager #109

Pseudocode

Ideal usage