usds / justice40-tool

A tool to identify disadvantaged communities due to environmental, socioeconomic and health burdens
https://screeningtool.geoplatform.gov/
Creative Commons Zero v1.0 Universal
133 stars 42 forks source link

Load RSEI industrial pollution data to make available in the tool database #1070

Closed lucasmbrown-usds closed 2 years ago

lucasmbrown-usds commented 2 years ago

Description RSEI's data is available crosswalked to census tracts:

RSEI Geographic Microdata provides results at the grid cell level. Crosswalks to translate data from the RSEI grid cell system to U.S. census block geographies are provided so that users can combine RSEI results with other data and aggregate the results to any census geography, like block, block group, or tract.

https://www.epa.gov/rsei/rsei-grid-and-locational-information

This could be a valuable resource for information on pollution sources. It is updated on a yearly basis with a deep historical record, and because the data collection is legislatively mandated, it's very comprehensive.

saran-ahluwalia commented 2 years ago
  1. Data dictionary
  2. Methodology documentation
  3. Selecting data sources

Sample instructions from the documentation:

For example, if you filter your data by selecting "325 Chemicals" and "326 Plastics and Rubber" from the Industry Sector field, the subset of data will be any records where "325 Chemicals" or "326 Plastics and Rubber" is the value in the Industry Sector field - "325 Chemicals" and "326 Plastics and Rubber" are both in the selected state. All other values in the Industry Sector field are alternative states. Within "325 Chemicals" and "326 Plastics and Rubber" are many subsectors that are denoted as possible states because, though not selected, they are possible given the current selection state.

This begs several questions for further investigation and for clarification:

  1. What is the question one is trying to answer with this dataset. For example which sectors and/or facilities contribute to the most "X" type of particulate or carcinogen?
  2. It's not immediately clear how to access the datasets programmatically without going through the reporting tool (please see item (3) above)
  3. Do we want modeled data? This includes include identifiers, locational data, water modeling data, air modeling data, and other information about the facility.
lucasmbrown-usds commented 2 years ago

Looks like there's a single aggregate score they produce: https://www.epa.gov/rsei/understanding-rsei-results#what.

saran-ahluwalia commented 2 years ago

Hi Lucas,

As we discussed the three questions that I raise - and that are manifested in the SEI score and some additional notes that address (2) above:

  1. RSEI Scores do not describe a level of risk (such as the number of excess lung cancer cases). RSEI does not represent risk-assessments; rather to indicate area of further analysis - which is why I raised (1), above.

  2. From the documents: RSEI Scores are designed to be compared to each other. A RSEI Score 10 times higher than another RSEI Score suggests that the potential for risk is 10 times higher. Relatively small chemical releases may lead to high RSEI Scores if the toxicity weight is particularly high or if the modeled exposed population is large. Conversely, large chemical releases may lead to low RSEI Scores if the toxicity weight is low or if the modeled exposed population is small. High RSEI Scores identify areas that may warrant further investigation, and do not conclusively demonstrate sources of risk. A facility with a high RSEI Score may not be the primary source of chronic human health risk in an area. RSEI Scores point to chemical releases that may warrant further investigation.

  3. There seems to be be an FTP server that requires authentication for the census tract microdata at "newftp.epa.gov"

saran-ahluwalia commented 2 years ago

This provides details on how RSEI should be used: https://www.epa.gov/rsei/how-rsei-should-be-used.

saran-ahluwalia commented 2 years ago

Below is the summary of basic exploration of the reporting year 2019. According to the associate that I was in contact with they have not produced census tract files for any year but 2019.

According to the associate, in the future they will have 2020 data, which will have all the years of data by census tract up to 2020. They will host this data on another production server - this dataset is too large for the FTP host. Generally, the way to access the data is to contact us, and they will provide download links.

import pandas as pd
import os
import ftplib

ftp = ftplib.FTP('newftp.epa.gov')
ftp.login()
'230 User logged in, proceed.'

Summary of findings from searching FTP server

In order to ensure alignment with our unit of measurement, we seek two distinct sources:

Across many of tables it appears the foreign key is the DCN (Unique identifier assigned by TRI to each facility submission - otherwise known as the document control number).

The source documentation with the associated tables is found here, page 20.

def ftp_as_zip(file_name, file_path, host, data_location = '.'):
    '''This is the generic function for accessing the ftp server and saving a designated
    file from the host server to a local machine
    '''
    print("Connecting to: ", host)
    ftp =  ftplib.FTP(host)
    ftp.login()
    ftp.cwd(file_path)
    print("Data will be peristed to: " + data_location)
    if not os.path.exists(data_location):
        os.makedirs(data_location)
    print("writing file: " + file_name)
    with open(os.path.join(data_location, file_name), "wb") as f_handler:
        # this is weird, I know. The second argument is a callback
        # function - which in this instance to call the file handler      
        ftp.retrbinary('RETR %s' % file_name, f_handler.write)
        ftp.quit()
# Example Usage - Note this is just an example and does 
# not represent the current files we seek
file_name = 'poly_gc14_conus_810m_top.dbf' #downloading from EPA public FTP site
file_path='RSEI/Shapefiles/810m_Standard_Grid_Shapefiles/' # path on host server
host = 'newftp.epa.gov' # host
ftp_as_zip(file_name, file_path, host)
Connecting to:  newftp.epa.gov
Data will be peristed to: .
writing file: poly_gc14_conus_810m_top.dbf
We would want the 2019 version of the release, correct? This appears to be the latest version of the release
 ftp.cwd("RSEI/Version239_RY2019/")
'250 Directory changed to /RSEI/Version239_RY2019'

4 directories

files = ftp.nlst()
len(files)
4

Each is a specified directory.....let''s take a look at each

files
['Aggregated_Grid_Cell_Data',
 'Disaggregated_Microdata',
 'Public_Release_Data',
 'RSEI_Queries']

Below is an example of what is available

Example Shape files

RSEI crosswalks and shapefiles. RSEI crosswalks allow you to transpose data from the RSEI grid to Census blocks, and shapefiles provide grid geometry for mapping microdata

import geopandas
import pandas as pd
/Users/sarahluw/.pyenv/versions/3.6.2/envs/my-virtual-env-3.6.2/lib/python3.6/site-packages/geopandas/_compat.py:110: UserWarning: The Shapely GEOS version (3.9.1-CAPI-1.14.2) is incompatible with the GEOS version PyGEOS was compiled with (3.10.1-CAPI-1.16.0). Conversions between both will be slow.
  shapely_geos_version, geos_capi_version_string
gdf = geopandas.read_file("poly_gc54_guam_810m.shp")  
gdf.head()
CELLX CELLY CLAT CLONG CX CY geometry
0 -1400 1814 13.181812 144.532682 -1134000 1469340 POLYGON ((144.52879 13.18531, 144.53626 13.185...
1 -1399 1814 13.182124 144.540150 -1133190 1469340 POLYGON ((144.53626 13.18562, 144.54372 13.185...
2 -1398 1814 13.182436 144.547618 -1132380 1469340 POLYGON ((144.54372 13.18594, 144.55119 13.186...
3 -1397 1814 13.182747 144.555085 -1131570 1469340 POLYGON ((144.55119 13.18625, 144.55866 13.186...
4 -1396 1814 13.183058 144.562553 -1130760 1469340 POLYGON ((144.55866 13.18656, 144.56613 13.186...
DBF Details Source

A DBF file is a standard database file used by dBASE, a database management system application. It organizes data into multiple records with fields stored in an array data type. DBF files are also compatible with other "xBase" database programs, which arose because of the file format's popularity.

import dbf

table = dbf.Table('poly_gc14_conus_810m_top.dbf')
# these do not reflect the level of geographic measurement
table.field_names
FieldnameList(['CELLX', 'CELLY', 'CLAT', 'CLONG', 'CX', 'CY'])

Aggregate - this is what I thought is what we sought (page 20 of this reference)

However, it is in fact this

import pandas as pd
# hmmmm, okay, so no header - 
# no problem, let's see if we just use the file handler from gzip (further down)
data = pd.read_csv('/Users/sarahluw/Downloads/aggmicro2019_2019_gc14.csv.gz', 
                   nrows=10, compression='gzip', header=1,
                   error_bad_lines=False)
data.head()
   1745  311  1   7  4  6.03996E-003  0.00000E+000  0.00000E+000.1  \
0  1746  311  1   7  4      0.006368           0.0             0.0   
1  1747  311  1   7  4      0.006731           0.0             0.0   
2  1748  311  1   7  4      0.007100           0.0             0.0   
3  1749  311  1   7  4      0.007324           0.0             0.0   
4  1750  311  1   7  4      0.007367           0.0             0.0   
5  1751  311  1   7  4      0.007427           0.0             0.0   
6  1752  311  1   7  4      0.007650           0.0             0.0   
7  1753  311  1   7  4      0.007895           0.0             0.0   
8  1754  311  2  11  6      0.020016           0.0             0.0   
9  1755  311  2  11  6      0.021184           0.0             0.0   

   0.00000E+000.2  0.00000E+000.3  
0             0.0             0.0  
1             0.0             0.0  
2             0.0             0.0  
3             0.0             0.0  
4             0.0             0.0  
5             0.0             0.0  
6             0.0             0.0  
7             0.0             0.0  
8             0.0             0.0  
9             0.0             0.0  
import gzip

with gzip.open('/Users/sarahluw/Downloads/aggmicro2019_2019_gc14.csv.gz', 'rb') as f:
    data = pd.read_csv(f)
# easier to read
data.head()
1744 311 1 7 4 5.71468E-003 0.00000E+000 0.00000E+000.1 0.00000E+000.2 0.00000E+000.3
0 1745 311 1 7 4 0.006040 0.0 0.0 0.0 0.0
1 1746 311 1 7 4 0.006368 0.0 0.0 0.0 0.0
2 1747 311 1 7 4 0.006731 0.0 0.0 0.0 0.0
3 1748 311 1 7 4 0.007100 0.0 0.0 0.0 0.0
4 1749 311 1 7 4 0.007324 0.0 0.0 0.0 0.0

These tables are avialable in RSEI_Queries. Here is an example. The references are here

Foreign keys that relate Releases with Submissions

  1. ReleaseNumber - Unique internal identifier.
  2. SubmissionNumber - Unique internal identifier
import jaydebeapi
import pandas as pd
import pandas_access as mdb

db_path = "rsei_data3_v239.accdb"

# Listing the tables from Microsoft Access tables
for tbl in mdb.list_tables(db_path):
    print(tbl)
Releases
Submissions
# Read a huge table
accumulator = []
for chunk in mdb.read_table(db_path, "Submissions", chunksize=10000):
    accumulator.append(chunk)
accumulator[0].head()
DCN SubmissionNumber FacilityID FacilityNumber CAS ChemicalNumber MPOUse MaxOnsite Entire_Fac One_Time_Release_Qty One_Time_Release_Qty_NA Partial_Fac Trade_Secret_Ind SubmissionYear
0 1300140000011 1 15902CCKRN75BRI 8527 N982 595 111 04 1 0 0 0 0 2000
1 1300140000023 2 15902CCKRN75BRI 8527 007439921 346 111 02 1 0 0 0 0 2000
2 1300140000035 3 70363GLFSL583TH 45099 000108883 549 011 03 1 NaN 1 0 0 2000
3 1300140000047 4 70363GLFSL583TH 45099 001330207 592 011 03 1 NaN 1 0 0 2000
4 1300140000050 5 70363GLFSL583TH 45099 007440666 594 010 03 1 NaN 1 0 0 2000
db_path_2 = "rsei_data1_v239.accdb"

# First Table
for tbl in mdb.list_tables(db_path_2):
    print(tbl)
Category
NAICSCode
Sictable
Chemical
Media
Facility
Offsite
db_path_3 = "rsei_data2_v239.accdb"

# Second Table
for tbl in mdb.list_tables(db_path_3):
    print(tbl)
Elements

Public Release data (RSEIv239_Public_Release_Data directory)

from glob import glob
list_of_files = list(glob(str("RSEIv239_Public_Release_Data/") + "/*.csv"))

dfs_list = [
    pd.read_csv(f)
    for f in list_of_files
]
/Users/sarahluw/.pyenv/versions/3.6.2/envs/my-virtual-env-3.6.2/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3263: DtypeWarning: Columns (3,24,39,43) have mixed types.Specify dtype option on import or set low_memory=False.
  if (await self.run_code(code, result,  async_=asy)):
/Users/sarahluw/.pyenv/versions/3.6.2/envs/my-virtual-env-3.6.2/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3263: DtypeWarning: Columns (19,30,31) have mixed types.Specify dtype option on import or set low_memory=False.
  if (await self.run_code(code, result,  async_=asy)):
/Users/sarahluw/.pyenv/versions/3.6.2/envs/my-virtual-env-3.6.2/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3263: DtypeWarning: Columns (8) have mixed types.Specify dtype option on import or set low_memory=False.
  if (await self.run_code(code, result,  async_=asy)):
/Users/sarahluw/.pyenv/versions/3.6.2/envs/my-virtual-env-3.6.2/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3263: DtypeWarning: Columns (6) have mixed types.Specify dtype option on import or set low_memory=False.
  if (await self.run_code(code, result,  async_=asy)):
# none of these tables contain the unit of measurement we are after.....
# although they do offer ample and comprehensive results

[df.columns for df in dfs_list]
[Index(['CASNumber', 'CASStandard', 'ChemicalNumber', 'Chemical', 'MCL1988',
        'MCL1989', 'MCL1990', 'MCL1991', 'MCL1992', 'MCL1993', 'MCL1994',
        'MCL1995', 'MCL1996', 'MCL1997', 'MCL1998', 'MCL1999', 'MCL2000',
        'MCL2001', 'MCL2002', 'MCL2003', 'MCL2004', 'MCL2005', 'MCL2006',
        'MCL2007', 'MCL2008', 'MCL2009', 'MCL2010', 'MCL2011', 'MCL2012',
        'MCL2013', 'MCL2014', 'MCL2015', 'MCL2016', 'MCL2017', 'MCL2018',
        'MCL2019'],
       dtype='object'),
 Index(['OffsiteID', 'FacilityNumber', 'TRIFID', 'DropIncinerator',
        'POTW_Incin', 'Name', 'Street', 'City', 'State', 'ZIPCode', 'ZIP9',
        'Latitude', 'Longitude', 'GridCode', 'Country', 'X', 'Y',
        'RadialDistance', 'StackHeight', 'StackVelocity', 'StackDiameter',
        'StackTemperature', 'StackParameterSource', 'NAICS', 'FRSID', 'FRSYear',
        'HEM3ID', 'DistanceToHEM3', 'WBANID', 'DistanceToWBAN', 'WaterReleases',
        'OutfallLongitude', 'OutfallLatitude', 'NearReach', 'NearComID',
        'DistanceToReach', 'AssignedReach', 'AssignedComID', 'ReachSource',
        'ReachNotes', 'LatLongSource', 'LatLongYear', 'LockLL',
        'CentroidAdjustment', 'Notes_on_coordinates',
        'AdditionalSourcesForLocation', 'LocationConfidence', 'Foreign'],
       dtype='object'),
 Index(['FacilityID', 'FacilityNumber', 'FRSID', 'Latitude', 'Longitude',
        'LLOverridden', 'GridCode', 'X', 'Y', 'LatLongSource', 'LLYear',
        'LLNotes', 'RadialDistance', 'FacilityName', 'Street', 'City', 'County',
        'State', 'ZIPCode', 'ZIP9', 'FIPS', 'STFIPS', 'DUNS', 'ParentDUNS',
        'ParentName', 'StandardizedParentCompany', 'Region',
        'FederalFacilityFlag', 'FederalAgencyName', 'PublicContactName',
        'PublicContactPhone', 'PublicContactPhoneExtension',
        'PublicContactEmail', 'StackHeight', 'StackVelocity', 'StackDiameter',
        'StackTemperature', 'StackHeightSource', 'StackVelocitySource',
        'StackDiameterSource', 'StackTemperatureSource', 'StackHeightNEIYear',
        'StackVelocityNEIYear', 'StackDiameterNEIYear',
        'StackTemperatureNEIYear', 'ChromHexPercent', 'ChromSource',
        'NewIndustryFlag', 'ModeledNAICS', 'NAICSCode3Digit', 'NAICSCode4Digit',
        'NAICSCode5Digit', 'NAICS1', 'NAICS2', 'NAICS3', 'NAICS4', 'NAICS5',
        'NAICS6', 'SIC1', 'FinalReach', 'FinalCOMID', 'ReachSource',
        'DistanceToReach', 'HEM3ID', 'DistanceToHEM3', 'LLConfirmed',
        'WaterReleases', 'ModeledReleases', 'ModChromReleases'],
       dtype='object'),
 Index(['DCN', 'SubmissionNumber', 'FacilityID', 'FacilityNumber', 'CAS',
        'ChemicalNumber', 'SubmissionYear', 'MPOUse', 'MaxOnsite', 'EntireFac',
        'OneTimeReleaseQty', 'OneTimeReleaseQtyNA', 'PartialFac',
        'TradeSecretInd'],
       dtype='object'),
 Index(['2017NAICSCode', 'LongName', 'Changed2017', 'TRIIndustrySector',
        'IndustrySubsector', '4DigitNAICS', 'NewIndustry'],
       dtype='object'),
 Index(['ReleaseNumber', 'SubmissionNumber', 'Media', 'PoundsReleased',
        'OffsiteNumber', 'TEF', 'TEFSource', 'DCN', 'TransferLocNum',
        'OffsiteAmountSequence', 'WaterSequenceNum'],
       dtype='object'),
 Index(['ElementNumber', 'ReleaseNumber', 'PoundsPT', 'ScoreCategory', 'Score',
        'Population', 'ScoreA', 'PopA', 'ScoreB', 'PopB', 'ScoreC', 'PopC',
        'ScoreD', 'PopD', 'ScoreE', 'PopE', 'NCScore', 'CScore', 'Hazard',
        'HazardC', 'HazardNC'],
       dtype='object'),
 Index(['Media', 'MediaText', 'TRIEnvironmentalMedium', 'Itw', 'Otw', 'Mtw',
        'MediaCode', 'TRICode', 'TRICategory', 'LongDescription',
        'AggDescription'],
       dtype='object'),
 Index(['ScoreCategory', 'Category', 'Model', 'InhaleTox'], dtype='object'),
 Index(['CASNumber', 'CASStandard', 'ChemicalNumber', 'TRIChemID',
        'MetalCombinedChemNum', 'Category', 'SortCAS', 'SortName',
        'FullChemicalName', 'Chemical', 'Added', 'ToxicitySource', 'RfCInhale',
        'RfCUF', 'RfCMF', 'RfCConf', 'RfCSource', 'RfCListingDate',
        'RfCToxWeight', 'RFDOral', 'RfDUF', 'RfDMF', 'RfDConf',
        'RfDListingDate', 'RfDSource', 'RfDToxWeight', 'UnitRiskInhale',
        'QSTAROral', 'WOE', 'UnitRiskListingDate', 'UnitRiskSource',
        'IURToxWeight', 'QStarListingDate', 'QStarSource', 'OSFToxWeight',
        'WOEListingDate', 'WOESource', 'ITW', 'OTW', 'ToxicityClassOral',
        'ToxicityClassInhale', 'ToxicityCategory', 'AirDecay', 'Koc',
        'H2ODecay', 'LOGKow', 'Kd', 'WaterSolubility', 'POTWPartitionRemoval',
        'POTWPartitionSludge', 'POTWPartitionVolat', 'POTWPartitionBiod',
        'IncineratorDRE', 'BCF', 'Henrys', 'MCL', 'MolecularWeight', 'HAPFlag',
        'CAAFlag', 'PriorityPollutantFlag', 'SDWAFlag', 'CERCLAFlag',
        'OSHACarcinogens', 'ExpansionFlag', 'Core88ChemicalFlag',
        'Core95ChemicalFlag', 'Core98ChemicalFlag', 'Core00ChemicalFlag',
        'Core01ChemicalFlag', 'HPVFlag', 'HPVChallenge', 'PBTFlag', 'Metal',
        'HasTox', 'MaxTW', 'MaxNC', 'MaxC', 'Notes'],
       dtype='object')]

Disaggregated Microdata

Values represent columns here

# Lazily reading files into single Dask DataFrame (~21 GB)
import dask.dataframe as dd

ddf_direct = dd.read_csv(
    "/Users/sarahluw/Downloads/micro2019_2019.csv.gz",
    blocksize=None, 
    compression='gzip',
    sample=100
)
# first one is the GridCode
# second X-coordinate of grid cell
# third Y-Coordinate of grid cell
ddf_direct.columns
Index(['44', '-58', '-4', '6457356', '519', '16', '1', '2.17067E-05',
       '7.59735E-05', '0.00000E+00', '0.00000E+00.1', '0.00000E+00.2',
       '0.00000E+00.3'],
      dtype='object')

CSV Aggregate for Tract Level Geographic Unit - after exchange with contact at EPA

import pandas as pd

url = "http://abt-rsei.s3.amazonaws.com/microdata2019/census_agg/CensusMicroTracts2019_2019_aggregated.zip"
df = pd.read_csv(url, header=0)
columns = [
        "GEOID10", 
        "NUMFACS",
        "NUMRELEASES",
        "NUMCHEMS",
        "TOXCONC",
        "SCORE",
        "POP",
        "CSCORE",
        "NCSCORE"
]
# the columns are actually a census tract's data at this point
# The subsequent step make the current columns the first row
# of the database
df.columns
Index(['36047023200', '193', '904', '96', '4.85177E+002', '7.59714E+002',
       '6.19922E+003', '6.10315E+002', '1.54848E+002'],
      dtype='object')
first_row = dict(zip(columns, df.columns))
# add first row to dataframe
row_df = pd.DataFrame(first_row, index=[0])
df.columns = columns
# simply concatenate both dataframes
df = pd.concat([row_df, df]).reset_index(drop = True)
df.head(5)
GEOID10 NUMFACS NUMRELEASES NUMCHEMS TOXCONC SCORE POP CSCORE NCSCORE
0 36047023200 193 904 96 4.85177E+002 7.59714E+002 6.19922E+003 6.10315E+002 1.54848E+002
1 36029007202 84 376 81 227.366 79.5029 1601.87 63.0466 14.8128
2 42003429201 144 1624 251 2705.16 2553.95 4527.6 2447.72 113.751
3 17201004100 80 365 57 3626.62 3984.97 5766.72 3969.13 27.0562
4 47065011203 79 358 79 1133.4 1885.51 7820.66 1410.31 447.775