lkoolik / isrm_health_calculations

A repository of scripts used for converting emissions to concentrations and health impacts using the ISRM for California.
MIT License
2 stars 0 forks source link

ISRM Health Calculations

A repository of scripts used for converting emissions to concentrations and health impacts using the ISRM for California.

Libby Koolik, UC Berkeley

Last modified July 11, 2023


Note: this version of the code is archival. The model has been renamed to ECHO-AIR and moved to a new home. For more information, please visit https://echo-air-model.github.io.


Table of Contents


Purpose and Goals

The Intervention Model for Air Pollution (InMAP) is a powerful first step towards lowering key technical barriers by making simplifying assumptions that allow for streamlined predictions of PM2.5 concentrations resulting from emissions-related policies or interventions.[*] InMAP performance has been validated against observational data and WRF-Chem, and has been used to perform source attribution and exposure disparity analyses.[*, *, *] The InMAP Source-Receptor Matrix (ISRM) was developed by running the full InMAP model tens of thousands of times to understand how a unit perturbation of emissions from each grid cell affects concentrations across the grid. However, both InMAP and the ISRM require considerable computational and math proficiency to run and an understanding of various atmospheric science principles to interpret. Furthermore, estimating health impacts requires additional knowledge and calculations beyond InMAP. Thus, a need arises for a standalone and user-friendly process for comparing air quality health disparities associated with various climate change policy scenarios.

The ultimate goal of this repository is to create a pipeline for estimating disparities in health impacts associated with incremental changes in emissions. Annual average PM2.5 concentrations are estimated using the InMAP Source Receptor Matrix for California.


Methodology

The ISRM Health Calculation model works by a series of two modules. First, the model estimates annual average change in PM2.5 concentrations as part of the Concentration Module. Second, the excess mortality resulting from the concentration change is calculated in the Health Module.

Concentration Module Methodology

The InMAP Source Receptor Matrix (ISRM) links emissions sources to changes in receptor concentrations. There is a matrix layer for each of the five precursor species: primary PM2.5, ammonia (NH3), oxides of nitrogen (NOx), oxides of sulfur (SOx), and volatile organic compounds (VOC). By default, the tool uses the California ISRM. For each of these species in the California ISRM, the ISRM matrix dimensions are: 3 elevations by 21,705 sources by 21,705 receptors. The three elevations of release height within the ISRM are:

The tool is capable of reading in a different ISRM, if specified by the user.

The units of each cell within the ISRM are micrograms per meter cubed per microgram per second, or concentration per emissions.

The concentration module has the following steps. Details about the code handling each step are described in the Code Details(*) section below.

  1. Preprocessing: the tool will load the emissions shapefile and perform a series of formatting checks and adjustments. Any updates will be reported through the command line. Additionally, the ISRM layers will be imported as an object. The tool will also identify how many of the ISRM layers are required for concentration calculations.

For each layer triggered in the preprocessing step:

  1. Emissions Re-Allocation: the tool will re-grid emissions to the ISRM grid.
    1. The emissions shape and the ISRM shape are intersected.
    2. Emissions for the intersection object are allocated from the original emissions shape by the percent of the original emissions area that is contained within the intersection.
    3. Emissions are summed by ISRM grid cell.
    4. Note: for point source emissions, a small buffer is added to each point to allocate to ISRM grid cells.
  2. Matrix Multiplication: Once the emissions are re-gridded to the ISRM grid, they are multiplied by the ISRM grid level for the corresponding layer.

Once all layers are done:

  1. Sum all Concentrations: concentrations of PM2.5 are summed by ISRM grid cell.

Health Module Methodology

The ISRM Tool calculations health module follows US EPA BenMAP CE methodology and CARB guidance.

Currently, the tool is only built out to use the Krewski et al. (2009), endpoint parameters and functions.(*) The Krewski function is as follows:

$$ \Delta M = 1 - ( \frac{1}{\exp(\beta{d} \times C{i})} ) \times I{i,d,g} \times P{i,g} $$

where $\beta$ is the endpoint parameter from Krewski et al. (2009), $d$ is the disease endpoint, $C$ is the concentration of PM2.5, $i$ is the grid cell, $I$ is the baseline incidence, $g$ is the group, and $P$ is the population estimate. The tool takes the following steps to estimate these concentrations.

  1. Preprocessing: the tool will merge the population and incidence data based on geographic intersections using the health_data.py object type.

  2. Estimation by Endpoint: the tool will then calculate excess mortality by endpoint:

    1. The population-incidence data are spatially merged with the exposure concentrations estimated in the Concentration Module.
    2. For each row of the intersection, the excess mortality is estimated based on the function of choice (currently, only Krewski).
    3. Excess mortality is summed across age ranges by ISRM grid cell and racial/ethnic group.

Once all endpoints are done:

  1. Export and Visualize: excess mortality is exported as a shapefile and as a plot.

Other Features

The ISRM Tool has a command called check-setup that allows the user to make sure that all of the code and data files are properly saved and named in order to make sure that the program will run.


Code Details

Below is a brief table of contents for the Code Details section of the Readme.

Requirements

The code is written in Python 3. The library requirements are included in this repository as requirements.txt. For completeness, they are reproduced here:

Python libraries can be installed by running pip install -r requirements.txt on a Linux/Mac command line.

isrm_calcs.py

The isrm_calcs.py script is the main script file that drives the tool. This script operates the command line functionality, defines the health impact calculation objects, calls each of the supporting functions, and outputs the desired files. The isrm_calcs.py script is not split into functions or objects, instead, it is run through two sections: (1) Initialization and (2) Run Program.

Initialization

In the initialization section of isrm_calcs.py, the parser object is created in order to interface with the command line. The parser object is created using the argparse library.

Currently, the only arguments accepted by the parser object are -i for input file, -h for help, and --check-setup to run a setup check.

Once the parser is defined, the control file object is created using control_file.py class object. A number of metadata variables are defined from the control file.

Next, a number of internally saved data file paths are saved.

Finally, the output_region is defined based on the get_output_region function defined in tool_utils.py. The output region is then stored for use in later functions.

Run Program

The run program section of the code is split into two modes. If the CHECK_INPUTS flag is given, the tool will run in check mode, where it will check that each of the inputs is valid and then quit. If the CHECK_INPUTS flag is not given, the tool will run the full program.

It will start by creating a log file using the setup_logging function. Once the logging is set up, an output directory is created using the create_output_dir function from tool_utils.py. It will also create a shapefile subdirectory within the output folder directory using create_shape_out. The tool will also create an output_region geodataframe from user inputs for use in future steps.

Then, the tool will begin the concentration module. This starts by defining an emissions object and an isrm object using the emissions.py and isrm.py supporting class objects. The concentrations will be estimated using the concentration.py object, which relies on the concentration_layer.py object. The concentrations will then be output as a map of total exposure concentration and a shapefile with detailed exposure information.

Next, the tool will run environmental justice exposure calculations using the create_exposure_df, get_overall_disparity, and estimate_exposure_percentile functions from the environmental_justice_calcs.py file. The exposure percentiles will then be plotted and exported using the plot_percentile_exposure function. If the control file has indicated that exposure data should be output (using the 'OUTPUT_EXPOSURE' flag), a shapefile of exposure concentrations by population group will be output in the output directory.

Finally, if indicated by the user, the tool will begin the health module. It will create the health input object using the health_data.py library and then estimate the three endpoints of excess mortality using calculate_excess_mortality from the health_impact_calcs file. Each endpoint will then be mapped and exported using visualize_and_export_hia.

The tool utilizes parallel computing to increase efficiency and reduce runtime. As such, many of these steps do not happen exactly in the order presented above.

The program has completed when a box stating "Success! Run complete." shows on the screen.

Check Module

If enabled in the control file, the program will run in check mode, which will run a number of checks built into the emissions, isrm, and population objects. Once it runs all checking functions, it will quit and inform the user of the result.

Supporting Code

To streamline calculations and increase functionality of the code, python classes were created. These class definitions are saved in the supporting folder of the repository. The following sections outline how each of these classes work.

concentration_layer.py

The concentration_layer object runs ISRM-based calculations using a single vertical layer of the ISRM grid. The object inputs an emissions object (from emissions.py), the ISRM object (from isrm.py), and the layer number corresponding to the vertical layer of the ISRM grid. The object then estimates concentrations at ground-level resulting from emissions at that vertical layer release range.

Inputs

Attributes

Calculated Attributes

Simple Functions

concentration.py

The concentration object runs ISRM-based calculations for each of the vertical layer's of the ISRM grid by processing individual concentration_layer objects. The object inputs an emissions object (from emissions.py) and the ISRM object (from isrm.py). The object then estimates total concentrations at ground-level resulting from emissions.

Inputs

Attributes

Calculated Attributes

Internal Functions

External Functions

control_file.py

The control_file object is used to check and read the control file for a run:

Inputs

Attributes

Internal Functions

External Functions

emissions.py

The emissions object is primarily built off of geopandas. It has the following attributes:

Inputs

Attributes

Calculated Attributes

Internal Functions

External Functions

health_data.py

The health_data object stores and manipulates built-in health data (population and incidence rates) from BenMAP. It inputs a dictionary of filepaths and two Boolean run options (verbose and race_stratified) to return dataframes of population, incidence, and combined population-incidence information (pop_inc).

Inputs

Calculated Attributes

Internal Functions

isrm.py

The isrm object loads, stores, and manipulates the ISRM grid data.

Inputs

Attributes

Calculated Attributes

Internal Functions

External Functions

population.py

The population object stores detailed Census tract-level population data for the environmental justice exposure calculations and the health impact calculations from an input population dataset.

Inputs

Attributes

Internal Functions

External Functions

Scripts

To streamline calculations and increase functionality of the code, python scripts were created for major calculations/operations. Scripts are saved in the scripts folder of the repository. The following sections outline the contents of each script file, and how the functions inside them work.

environmental_justice_calcs.py

The environmental_justice_calcs script file contains a number of functions that help calculate exposure metrics for environmental justice analyses.

  1. create_exposure_df: creates a dataframe ready for exposure calculations
    1. Inputs:
      • conc: concentration object from concentration.py
      • isrm_pop_alloc: population object (from population.py) re-allocated to the ISRM grid cell geometry
      • verbose: a Boolean indicating whether or not detailed logging statements should be printed
    2. Outputs
      • exposure_gdf: a geodataframe with the exposure concentrations and allocated population by racial group
    3. Methodology:
      1. Pulls the total concentration from the concentration object
      2. Grabs the population by racial/ethnic group from the population object
      3. Merges the concentration and population data based on the ISRM ID
      4. Adds the population weighted mean exposure as a column of the geodataframe using add_pwm_col
  2. add_pwm_col: adds an intermediate column that multiplies population by exposure concentration
    1. Inputs:
      • exposure_gdf: a geodataframe with the exposure concentrations and allocated population by racial group
      • group: the racial/ethnic group name
    2. Outputs:
      • exposure_gdf: a geodataframe with the exposure concentrations and allocated population by racial group, now with PWM column
    3. Methodology:
      1. Creates a column called group+'_PWM'.
      2. Multiplies exposure concentration by group population
      3. Returns the new dataframe
    4. Important Notes:
      • The new column is not actually a population-weighted mean, it is just an intermediate for calculating PWM in the next step.
  3. get_pwm: estimates the population-weighted mean exposure for a given group
    1. Inputs:
      • exposure_gdf: a geodataframe with the exposure concentrations and allocated population by racial group
      • group: the racial/ethnic group name
    2. Outputs:
      • PWM_group: the group-level population weighted mean exposure concentration (float)
    3. Methodology:
      1. Creates a variable for the group PWM column (as created in add_pwm_col
      2. Estimates PWM by adding across the group_PWM column and dividing by the total group population
  4. get_overall_disparity: returns a table of overall disparity metrics by racial/ethnic group
    1. Inputs:
      • exposure_gdf: a geodataframe with the exposure concentrations and allocated population by racial group
    2. Outputs:
      • pwm_df: a dataframe containing the PWM, absolute disparity, and relative disparity of each group
    3. Methodology:
      1. Creates an empty dataframe with the groups as rows
      2. Estimates the group population weighted mean using the get_pwm function
      3. Estimates the absolute disparity as Group_PWM - Total_PWM
      4. Estimates the relative disparity as the Absolute Disparity/Total_PWM
  5. estimate_exposure_percentile: creates a dataframe of exposure percentiles for plotting
    1. Inputs:
      • exposure_gdf: a geodataframe with the exposure concentrations and allocated population by racial group
      • verbose: a Boolean indicating whether or not detailed logging statements should be printed
    2. Outputs:
      • df_pctl: a dataframe of exposure concentrations by percentile of population exposed by group
    3. Methodology:
      1. Creates a copy of the exposure_gdf dataframe to prevent writing over the original.
      2. Sorts the dataframe by PM2.5 concentration and resets the index.
      3. Iterates through each racial/ethnic group, performing the following:
        1. Creates a small slice of the dataframe that is only the exposure concentration and the group.
        2. Estimates the cumulative sum of population in the sorted dataframe.
        3. Estimates the total population of the group.
        4. Estimates percentile as the population in the grid cell divided by the total population of the group.
        5. Adds the percentile column into the main dataframe.
  6. run_exposure_calcs: calls the other exposure justice functions in order
    1. Inputs:
      • conc: concentration object from concentration.py
      • isrm_pop_alloc: population object (from population.py) re-allocated to the ISRM grid cell geometry
      • verbose: a Boolean indicating whether or not detailed logging statements should be printed
    2. Outputs:
      • exposure_gdf: a dataframe containing the exposure concentrations and population estimates for each group
      • exposure_pctl: a dataframe of exposure concentrations by percentile of population exposed by group
      • exposure_disparity: a dataframe containing the PWM, absolute disparity, and relative disparity of each group
    3. Methodology:
      1. Calls the create_exposure_df function.
      2. Calls the get_overall_disparity function.
      3. Calls the estimate_exposure_percentile function.
  7. export_exposure_gdf: exports the exposure concentrations and population estimates as a shapefile
    1. Inputs:
      • exposure_gdf: a dataframe containing the exposure concentrations and population estimates for each group
      • shape_out: a filepath string of the location of the shapefile output directory
      • f_out: the name of the file output category (will append additional information)
    2. Outputs:
      • A shapefile will be output into the shape_out directory.
      • The function returns fname as a surrogate for completion (otherwise irrelevant)
    3. Methodology:
      1. Creates a filename and path for the export.
      2. Updates the columns slightly for shapefile naming
      3. Exports the shapefile.
  8. export_exposure_csv: exports the exposure concentrations and population estimates as a CSV file
    1. Inputs:
      • exposure_gdf: a dataframe containing the exposure concentrations and population estimates for each group
      • output_dir: a filepath string of the location of the output directory
      • f_out: the name of the file output category (will append additional information)
    2. Outputs:
      • A CSV file will be output into the output_dir.
      • The function returns fname as a surrogate for completion (otherwise irrelevant)
    3. Methodology:
      1. Creates a filename and path for the export.
      2. Updates the column names for more straightforward interpretation
      3. Exports the results as a comma-separated value (CSV) file.
  9. export_exposure_disparity: exports the exposure concentrations and population estimates as a shapefile
    1. Inputs:
      • exposure_disparity: a dataframe containing the population-weighted mean exposure concentrations for each group
      • output_dir: a filepath string of the location of the output directory
      • f_out: the name of the file output category (will append additional information)
    2. Outputs:
      • A shapefile will be output into the output_dir.
      • The function returns fname as a surrogate for completion (otherwise irrelevant)
    3. Methodology:
      1. Creates a filename and path for the export.
      2. Updates the columns and values slightly for more straightforward interpretation
      3. Exports the results as a comma-separated value (CSV) file.
  10. plot_percentile_exposure: creates a plot of exposure concentration by percentile of each group's population
    1. Inputs:
      • output_dir: a filepath string of the location of the output directory
      • f_out: the name of the file output category (will append additional information)
      • exposure_pctl: a dataframe of exposure concentrations by percentile of population exposed by group
      • verbose: a Boolean indicating whether or not detailed logging statements should be printed
    2. Outputs:
      • The function does not return anything, but a lineplot image (PNG) will be output into the output_dir.
    3. Methodology:
      1. Creates a melted (un-pivoted) version of the percentiles dataframe.
      2. Multiplies the percentile by 100 to span 0-100 instead of 0-1.
      3. Maps the racial/ethnic group names to better formatted names (e.g., "HISLA" --> "Hispanic/Latino")
      4. Draws the figure using the seaborn library's lineplot function.
      5. Saves the file as f_out + '_PM25_Exposure_Percentiles.png' into the out_dir.
  11. export_exposure: calls each of the exposure output functions in parallel
    1. Inputs:
      • exposure_gdf: a dataframe containing the exposure concentrations and population estimates for each group
      • exposure_disparity: a dataframe containing the population-weighted mean exposure concentrations for each group
      • exposure_pctl: a dataframe of exposure concentrations by percentile of population exposed by group
      • shape_out: a filepath string of the location of the shapefile output directory
      • output_dir: a filepath string of the location of the output directory
      • f_out: the name of the file output category (will append additional information)
      • verbose: a Boolean indicating whether or not detailed logging statements should be printed
    2. Outputs:
      • The function does not return anything, but a shapefile will be output into the output_dir.
    3. Methodology:
      1. Creates a filename and path for the export.
      2. Updates the columns slightly for shapefile naming
      3. Exports the shapefile.
  12. create_rename_dict: makes a global rename code dictionary for easier updating
    1. Inputs: None
    2. Outputs:
      • logging_code: a dictionary that maps endpoint names to log statement codes
    3. Methodology:
      1. Defines a dictionary and returns it.

health_impact_calcs.py

The health_impact_calcs script file contains a number of functions that help calculate health impacts from exposure concentrations.

  1. create_hia_inputs: creates the hia_inputs object.

    1. Inputs:
      • pop: population object input
      • load_file: a Boolean telling the program to load or not
      • verbose: a Boolean telling the program to return additional log statements or not
      • geodata: the geographic data from the ISRM
      • incidence_fp: a string containing the filepath where the incidence data is stored
    2. Outputs:
      • a health data object ready for health calculations
    3. Methodology
      1. Allocates population to the ISRM grid using the population object and the ISRM geodata.
      2. Initializes a health_data object from that allocated population.
  2. krewski: defines a Python function around the Krewski et al. (2009) function and endpoints

    1. Inputs:
      • verbose: a Boolean indicating whether or not detailed logging statements should be printed
      • conc: a float with the exposure concentration for a given geography
      • inc: a float with the background incidence for a given group in a given geography
      • pop: a float with the population estimate for a given group in a given geography
      • endpoint: a string containing either 'ALL CAUSE', 'ISCHEMIC HEART DISEASE', or 'LUNG CANCER'
    2. Outputs
      • a float estimating the number of excess mortalities for the endpoint across the group in a given geography
    3. Methodology:
      1. Based on the endpoint, grabs a beta parameter from Krewski et al. (2009).
      2. Estimates excess mortality using the following equation, where $\beta$ is the endpoint parameter from Krewski et al. (2009), $d$ is the disease endpoint, $C$ is the concentration of PM2.5, $i$ is the grid cell, $I$ is the baseline incidence, $g$ is the group, and $P$ is the population estimate.

$$ 1 - ( \frac{1}{\exp(\beta{d} \times C{i})} ) \times I{i,d,g} \times P{i,g} $$

  1. create_logging_code: makes a global logging code for easier updating

    1. Inputs: None
    2. Outputs:
      • logging_code: a dictionary that maps endpoint names to log statement codes
    3. Methodology:
      1. Defines a dictionary and returns it.
  2. calculate_excess_mortality: estimates excess mortality for a given endpoint and function

    1. Inputs:
      • conc: a float with the exposure concentration for a given geography
      • health_data_obj: a health_data object as defined in the health_data.py supporting script
      • endpoint: a string containing either 'ALL CAUSE', 'ISCHEMIC HEART DISEASE', or 'LUNG CANCER'
      • function: the health impact function of choice (currently only krewski is built out)
      • verbose: a Boolean indicating whether or not detailed logging statements should be printed
    2. Outputs
      • pop_inc_conc: a dataframe containing excess mortality for the endpoint using the function provided
    3. Methodology:
      1. Creates clean, simplified copies of the detailed_conc method of the conc object and the pop_inc method of the health_data_obj.
      2. Merges these two dataframes on the ISRM_ID field.
      3. Estimates excess mortality on a row-by-row basis using the function.
      4. Pivots the dataframe to get the individual races as columns.
      5. Adds the geometry back in to make it geodata.
      6. Updates the column names such that the excess mortality columns are ENDPOINT_GROUP.
      7. Merges the population back into the dataframe.
      8. Cleans up the dataframe.
  3. plot_total_mortality: creates a map image (PNG) of the excess mortality associated with an endpoint for a given group.

    1. Inputs:
      • hia_df: a dataframe containing excess mortality for the endpoint using the function provided
      • ca_shp_fp: a filepath string of the California state boundary shapefile
      • group: the racial/ethnic group name
      • endpoint: a string containing either 'ALL CAUSE', 'ISCHEMIC HEART DISEASE', or 'LUNG CANCER'
      • output_dir: a filepath string of the location of the output directory
      • f_out: the name of the file output category (will append additional information)
      • verbose: a Boolean indicating whether or not detailed logging statements should be printed
    2. Outputs
      • fname: a string filename made by combining the f_out with the group and endpoint.
    3. Methodology:
      1. Sets a few formatting standards within seaborn and matplotlib.pyplot.
      2. Creates the output file directory and name string using f_out, group, and endpoint.
      3. Reads in the California boundary and projects the hia_df to match the coordinate reference system of the California dataset.
      4. Clips the dataframe to the California boundary.
      5. Adds area-normalized columns to the hia_df for more intuitive plotting.
      6. Grabs the minimums and sets them to 10-9 in order to avoid logarithm conversion errors.
      7. Updates the 'MORT_OVER_POP' column to avoid 100% mortality that arises from the update in step 6.
      8. Initializes the figure and plots four panes:
        1. Population density: plots the area-normalized population estimates for the group on a log-normal scale.
        2. PM2.5 exposure concentrations: plots the exposure concentration on a log-normal scale.
        3. Excess mortality per area: plots the excess mortality per unit area on a log-normal scale.
        4. Excess mortality per population: plots the excess mortality per population for the group on a log-normal scale.
      9. Performs a bit of clean-up and formatting before exporting.
  4. export_health_impacts: exports mortality as a shapefile

    1. Inputs:
      • hia_df: a dataframe containing excess mortality for the endpoint using the function provided
      • group: the racial/ethnic group name
      • endpoint: a string containing either 'ALL CAUSE', 'ISCHEMIC HEART DISEASE', or 'LUNG CANCER'
      • output_dir: a filepath string of the location of the output directory
      • f_out: the name of the file output category (will append additional information)
      • verbose: a Boolean indicating whether or not detailed logging statements should be printed
    2. Outputs
      • fname: a string filename made by combining the f_out with the group and endpoint.
    3. Methodology:
      1. Creates the output file path (fname) using inputs.
      2. Creates endpoint short labels and updates column names since shapefiles can only have ten characters in column names.
      3. Exports the geodataframe to shapefile.
  5. export_health_impacts_csv: exports mortality as a csv

    1. Inputs:
      • hia_df: a dataframe containing excess mortality for the endpoint using the function provided
      • endpoint: a string containing either 'ALL CAUSE', 'ISCHEMIC HEART DISEASE', or 'LUNG CANCER'
      • output_dir: a filepath string of the location of the output directory
      • f_out: the name of the file output category (will append additional information)
      • verbose: a Boolean indicating whether or not detailed logging statements should be printed
    2. Outputs
      • fname: a string filename made by combining the f_out with the group and endpoint.
    3. Methodology:
      1. Creates the output file path (fname) using inputs.
      2. Revises column names for clarity
      3. Exports the geodataframe to csv.
  6. create_summary_hia: creates a summary table of health impacts by racial/ethnic group

    1. Inputs:
      • hia_df: a dataframe containing excess mortality for the endpoint using the function provided
      • endpoint: a string containing either 'ALL CAUSE', 'ISCHEMIC HEART DISEASE', or 'LUNG CANCER'
      • verbose: a Boolean indicating whether or not detailed logging statements should be printed
      • l: an intermediate string that has the endpoint label string (e.g., ACM_)
      • endpoint_nice: an intermediate string that has a nicely formatted version of the endpoint (e.g., All Cause)
    2. Outputs
      • hia_summary: a summary dataframe containing population, excess mortality, and excess mortality rate per demographic group
    3. Methodology:
      1. Cleans up the hia_df by changing column names and splitting population and mortality
      2. Gets total population and mortality by group
      3. Combines into one dataframe and cleans it up for export
  7. visualize_and_export_hia: calls plot_total_mortality and export_health_impacts in one clean function call.

    1. Inputs:
      • hia_df: a dataframe containing excess mortality for the endpoint using the function provided
      • ca_shp_fp: a filepath string of the California state boundary shapefile
      • group: the racial/ethnic group name
      • endpoint: a string containing either 'ALL CAUSE', 'ISCHEMIC HEART DISEASE', or 'LUNG CANCER'
      • output_dir: a filepath string of the location of the output directory
      • f_out: the name of the file output category (will append additional information)
      • shape_out: a filepath string for shapefiles
      • verbose: a Boolean indicating whether or not detailed logging statements should be printed
    2. Outputs
      • hia_summary: a summary dataframe containing population, excess mortality, and excess mortality rate per demographic group
    3. Methodology:
      1. Calls plot_total_mortality.
      2. Calls `export_health_impacts.
  8. combine_hia_summaries: combines the three endpoint summary tables into one export file

    1. Inputs:
      • acm_summary: a summary dataframe containing population, excess all-cause mortality, and all-cause mortality rates
      • ihd_summary: a summary dataframe containing population, excess IHD mortality, and IHD mortality rates
      • lcm_summary: a summary dataframe containing population, excess lung cancer mortality, and lung cancer mortality rates
      • output_dir: a filepath string of the location of the output directory
      • f_out: the name of the file output category (will append additional information)
      • verbose: a Boolean indicating whether or not detailed logging statements should be printed
    2. Outputs: None
    3. Methodology:
      1. Merges the summary dataframes together
      2. Removes excess columns
      3. Saves as CSV file
  9. create_rename_dict: makes a global rename code dictionary for easier updating

    1. Inputs: None
    2. Outputs:
      • logging_code: a dictionary that maps endpoint names to log statement codes
    3. Methodology:
      1. Defines a dictionary and returns it.

tool_utils.py

The tool_utils library contains a handful of scripts that are useful for code execution.

  1. check_setup: checks that the isrm_health_calculations local clone is set up properly

    1. Inputs: None
    2. Outputs:
      • valid_setup: a Boolean indicating if the setup is valid or not
    3. Methodology:
      1. Gets the programs current working directory.
      2. Checks that all the script and supporting files exist where they are supposed to.
      3. Checks that all key data files are saved where they should (not including ISRM)
      4. Checks that the CA_ISRM is located in the data folder with all necessary objects, but does not consider this an improper setup, as the user may have their own ISRM.
      5. Reports any missing files or directories.
  2. setup_logging: sets up the log file capability using the logging library

    1. Inputs:
      • debug_mode: a Boolean indicating if log statements should be returned in debug mode or not
    2. Outputs:
      • tmp_logger: a filepath string associated with a temporary log file that will be moved as soon as the output directory is created
    3. Methodology:
      1. Defines useful variables for the logging library.
      2. Creates a temporary log file path (tmp_logger) that allows the file to be created before the output directory.
      3. Suppresses all other library warnings and information.
      4. Sets the formatting system for log statements.
  3. verboseprint: sets up the verbose printing mechanism for global usage

    1. Inputs:
      • verbose: a Boolean indicating if it is in verbose mode or not
      • text: a string to be returned if the program is in verbose mode
    2. Outputs: None
    3. Methodology:
      1. Checks if verbose is True.
      2. If True, creates a log statement.
      3. If False, does nothing.
  4. report_version: reports the current working version of the tool

    1. Inputs: None
    2. Outputs: None
    3. Methodology: adds statements to the log file about the tool version
  5. create_output_dir: creates the output directory for saving files

    1. Inputs:
      • batch: the batch name
      • name: the run name
    2. Outputs:
      • output_dir: a filepath string for the output directory
      • f_out: a string containing the filename pattern to be used in output files
    3. Methodology:
      1. Grabs the current working directory of the tool and defines 'outputs' as the sub-directory to use.
      2. Checks to see if the directory already exists. If it does exists, automatically increments by 1 to create a unique directory.
      3. Creates f_out by removing the 'out' before the output_dir.
      4. Creates the output directory.
  6. create_shape_out: creates the output directory for saving shapefiles

    1. Inputs:
      • output_dir: a filepath string for the output directory
    2. Outputs:
      • shape_out: a filepath string for the shapefile output directory
    3. Methodology:
      1. Creates a directory within the output_dir called 'shapes'.
      2. Stores this name as shape_out.
  7. get_output_region: creates the output region geodataframe

    1. Inputs:
      • region_of_interest: the name of the region to be contained in the output_region
      • region_category: a string containing the region category for the output region, must be one of 'AB','AD', or 'C' for Air Basins, Air Districts, and Counties
      • output_geometry_fps: a dictionary containing a mapping between region_category and the filepaths
      • ca_fps: a filepath string containing the link to the California border shapefile
    2. Outputs
      • output_region: a geodataframe containing only the region of interest
    3. Methodology:
      1. Checks if the region_of_interest is California, in which case, it just reads in the California shapefile.
      2. If California is not the region_of_interest:
        1. Gets the filepath of the output region based on the region_category from the output_geometry_fps dictionary.
        2. Reads in the file as a geodataframe.
        3. Clips the geodataframe to the region_of_interest.

Running the Tool

The tool is configured to be run on a Mac or via Linux terminal (including Windows Subsystem for Linux) on the Google Cloud or Windows Subsystem for Linux. Instructions for each of those are linked in the previous sentence.


Acknowledgments

In alphabetical order, the following people are acknowledged for their support and contributions: