Load RSEI industrial pollution data to make available in the tool database

lucasmbrown-usds commented 2 years ago

Description RSEI's data is available crosswalked to census tracts:

RSEI Geographic Microdata provides results at the grid cell level. Crosswalks to translate data from the RSEI grid cell system to U.S. census block geographies are provided so that users can combine RSEI results with other data and aggregate the results to any census geography, like block, block group, or tract.

https://www.epa.gov/rsei/rsei-grid-and-locational-information

This could be a valuable resource for information on pollution sources. It is updated on a yearly basis with a deep historical record, and because the data collection is legislatively mandated, it's very comprehensive.

saran-ahluwalia commented 2 years ago

Data dictionary
Methodology documentation
Selecting data sources

Sample instructions from the documentation:

For example, if you filter your data by selecting "325 Chemicals" and "326 Plastics and Rubber" from the Industry Sector field, the subset of data will be any records where "325 Chemicals" or "326 Plastics and Rubber" is the value in the Industry Sector field - "325 Chemicals" and "326 Plastics and Rubber" are both in the selected state. All other values in the Industry Sector field are alternative states. Within "325 Chemicals" and "326 Plastics and Rubber" are many subsectors that are denoted as possible states because, though not selected, they are possible given the current selection state.

This begs several questions for further investigation and for clarification:

What is the question one is trying to answer with this dataset. For example which sectors and/or facilities contribute to the most "X" type of particulate or carcinogen?
It's not immediately clear how to access the datasets programmatically without going through the reporting tool (please see item (3) above)
Do we want modeled data? This includes include identifiers, locational data, water modeling data, air modeling data, and other information about the facility.

lucasmbrown-usds commented 2 years ago

Looks like there's a single aggregate score they produce: https://www.epa.gov/rsei/understanding-rsei-results#what.

saran-ahluwalia commented 2 years ago

Hi Lucas,

As we discussed the three questions that I raise - and that are manifested in the SEI score and some additional notes that address (2) above:

RSEI Scores do not describe a level of risk (such as the number of excess lung cancer cases). RSEI does not represent risk-assessments; rather to indicate area of further analysis - which is why I raised (1), above.
From the documents: RSEI Scores are designed to be compared to each other. A RSEI Score 10 times higher than another RSEI Score suggests that the potential for risk is 10 times higher. Relatively small chemical releases may lead to high RSEI Scores if the toxicity weight is particularly high or if the modeled exposed population is large. Conversely, large chemical releases may lead to low RSEI Scores if the toxicity weight is low or if the modeled exposed population is small. High RSEI Scores identify areas that may warrant further investigation, and do not conclusively demonstrate sources of risk. A facility with a high RSEI Score may not be the primary source of chronic human health risk in an area. RSEI Scores point to chemical releases that may warrant further investigation.
There seems to be be an FTP server that requires authentication for the census tract microdata at "newftp.epa.gov"

saran-ahluwalia commented 2 years ago

This provides details on how RSEI should be used: https://www.epa.gov/rsei/how-rsei-should-be-used.

saran-ahluwalia commented 2 years ago

Below is the summary of basic exploration of the reporting year 2019. According to the associate that I was in contact with they have not produced census tract files for any year but 2019.

According to the associate, in the future they will have 2020 data, which will have all the years of data by census tract up to 2020. They will host this data on another production server - this dataset is too large for the FTP host. Generally, the way to access the data is to contact us, and they will provide download links.

import pandas as pd
import os
import ftplib

ftp = ftplib.FTP('newftp.epa.gov')
ftp.login()

'230 User logged in, proceed.'

Summary of findings from searching FTP server

In order to ensure alignment with our unit of measurement, we seek two distinct sources:

Census Tract Aggregated Microdata files
Disaggregated Census Tract Microdata files

Across many of tables it appears the foreign key is the DCN (Unique identifier assigned by TRI to each facility submission - otherwise known as the document control number).

The source documentation with the associated tables is found here, page 20.

def ftp_as_zip(file_name, file_path, host, data_location = '.'):
    '''This is the generic function for accessing the ftp server and saving a designated
    file from the host server to a local machine
    '''
    print("Connecting to: ", host)
    ftp =  ftplib.FTP(host)
    ftp.login()
    ftp.cwd(file_path)
    print("Data will be peristed to: " + data_location)
    if not os.path.exists(data_location):
        os.makedirs(data_location)
    print("writing file: " + file_name)
    with open(os.path.join(data_location, file_name), "wb") as f_handler:
        # this is weird, I know. The second argument is a callback
        # function - which in this instance to call the file handler      
        ftp.retrbinary('RETR %s' % file_name, f_handler.write)
        ftp.quit()

# Example Usage - Note this is just an example and does 
# not represent the current files we seek
file_name = 'poly_gc14_conus_810m_top.dbf' #downloading from EPA public FTP site
file_path='RSEI/Shapefiles/810m_Standard_Grid_Shapefiles/' # path on host server
host = 'newftp.epa.gov' # host
ftp_as_zip(file_name, file_path, host)

Connecting to:  newftp.epa.gov
Data will be peristed to: .
writing file: poly_gc14_conus_810m_top.dbf

We would want the 2019 version of the release, correct? This appears to be the latest version of the release

 ftp.cwd("RSEI/Version239_RY2019/")

'250 Directory changed to /RSEI/Version239_RY2019'

4 directories

files = ftp.nlst()
len(files)

Each is a specified directory.....let''s take a look at each

files

['Aggregated_Grid_Cell_Data',
 'Disaggregated_Microdata',
 'Public_Release_Data',
 'RSEI_Queries']

Below is an example of what is available

Example Shape files

RSEI crosswalks and shapefiles. RSEI crosswalks allow you to transpose data from the RSEI grid to Census blocks, and shapefiles provide grid geometry for mapping microdata

import geopandas
import pandas as pd

/Users/sarahluw/.pyenv/versions/3.6.2/envs/my-virtual-env-3.6.2/lib/python3.6/site-packages/geopandas/_compat.py:110: UserWarning: The Shapely GEOS version (3.9.1-CAPI-1.14.2) is incompatible with the GEOS version PyGEOS was compiled with (3.10.1-CAPI-1.16.0). Conversions between both will be slow.
  shapely_geos_version, geos_capi_version_string

gdf = geopandas.read_file("poly_gc54_guam_810m.shp")

gdf.head()

	CELLX	CELLY	CLAT	CLONG	CX	CY	geometry
0	-1400	1814	13.181812	144.532682	-1134000	1469340	POLYGON ((144.52879 13.18531, 144.53626 13.185...
1	-1399	1814	13.182124	144.540150	-1133190	1469340	POLYGON ((144.53626 13.18562, 144.54372 13.185...
2	-1398	1814	13.182436	144.547618	-1132380	1469340	POLYGON ((144.54372 13.18594, 144.55119 13.186...
3	-1397	1814	13.182747	144.555085	-1131570	1469340	POLYGON ((144.55119 13.18625, 144.55866 13.186...
4	-1396	1814	13.183058	144.562553	-1130760	1469340	POLYGON ((144.55866 13.18656, 144.56613 13.186...

DBF Details Source

A DBF file is a standard database file used by dBASE, a database management system application. It organizes data into multiple records with fields stored in an array data type. DBF files are also compatible with other "xBase" database programs, which arose because of the file format's popularity.

import dbf

table = dbf.Table('poly_gc14_conus_810m_top.dbf')

# these do not reflect the level of geographic measurement
table.field_names

FieldnameList(['CELLX', 'CELLY', 'CLAT', 'CLONG', 'CX', 'CY'])

Aggregate - this is what I thought is what we sought (page 20 of this reference)

However, it is in fact this

import pandas as pd
# hmmmm, okay, so no header - 
# no problem, let's see if we just use the file handler from gzip (further down)
data = pd.read_csv('/Users/sarahluw/Downloads/aggmicro2019_2019_gc14.csv.gz', 
                   nrows=10, compression='gzip', header=1,
                   error_bad_lines=False)
data.head()

   1745  311  1   7  4  6.03996E-003  0.00000E+000  0.00000E+000.1  \
0  1746  311  1   7  4      0.006368           0.0             0.0   
1  1747  311  1   7  4      0.006731           0.0             0.0   
2  1748  311  1   7  4      0.007100           0.0             0.0   
3  1749  311  1   7  4      0.007324           0.0             0.0   
4  1750  311  1   7  4      0.007367           0.0             0.0   
5  1751  311  1   7  4      0.007427           0.0             0.0   
6  1752  311  1   7  4      0.007650           0.0             0.0   
7  1753  311  1   7  4      0.007895           0.0             0.0   
8  1754  311  2  11  6      0.020016           0.0             0.0   
9  1755  311  2  11  6      0.021184           0.0             0.0   

   0.00000E+000.2  0.00000E+000.3  
0             0.0             0.0  
1             0.0             0.0  
2             0.0             0.0  
3             0.0             0.0  
4             0.0             0.0  
5             0.0             0.0  
6             0.0             0.0  
7             0.0             0.0  
8             0.0             0.0  
9             0.0             0.0

import gzip

with gzip.open('/Users/sarahluw/Downloads/aggmicro2019_2019_gc14.csv.gz', 'rb') as f:
    data = pd.read_csv(f)

# easier to read
data.head()

	1744	311	1	7	4	5.71468E-003
0	1745	311	1	7	4	0.006040
1	1746	311	1	7	4	0.006368
2	1747	311	1	7	4	0.006731
3	1748	311	1	7	4	0.007100
4	1749	311	1	7	4	0.007324

These tables are avialable in RSEI_Queries. Here is an example. The references are here

Foreign keys that relate Releases with Submissions

ReleaseNumber - Unique internal identifier.
SubmissionNumber - Unique internal identifier

import jaydebeapi
import pandas as pd
import pandas_access as mdb

db_path = "rsei_data3_v239.accdb"

# Listing the tables from Microsoft Access tables
for tbl in mdb.list_tables(db_path):
    print(tbl)

Releases
Submissions

# Read a huge table
accumulator = []
for chunk in mdb.read_table(db_path, "Submissions", chunksize=10000):
    accumulator.append(chunk)

accumulator[0].head()

	DCN	SubmissionNumber	FacilityID	FacilityNumber	CAS	ChemicalNumber	MPOUse	MaxOnsite	Entire_Fac	One_Time_Release_Qty	One_Time_Release_Qty_NA	SubmissionYear
0	1300140000011	1	15902CCKRN75BRI	8527	N982	595	111	04	1	0	0	2000
1	1300140000023	2	15902CCKRN75BRI	8527	007439921	346	111	02	1	0	0	2000
2	1300140000035	3	70363GLFSL583TH	45099	000108883	549	011	03	1	NaN	1	2000
3	1300140000047	4	70363GLFSL583TH	45099	001330207	592	011	03	1	NaN	1	2000
4	1300140000050	5	70363GLFSL583TH	45099	007440666	594	010	03	1	NaN	1	2000

db_path_2 = "rsei_data1_v239.accdb"

# First Table
for tbl in mdb.list_tables(db_path_2):
    print(tbl)

Category
NAICSCode
Sictable
Chemical
Media
Facility
Offsite

db_path_3 = "rsei_data2_v239.accdb"

# Second Table
for tbl in mdb.list_tables(db_path_3):
    print(tbl)

Elements

Public Release data (RSEIv239_Public_Release_Data directory)

from glob import glob

list_of_files = list(glob(str("RSEIv239_Public_Release_Data/") + "/*.csv"))

dfs_list = [
    pd.read_csv(f)
    for f in list_of_files
]

/Users/sarahluw/.pyenv/versions/3.6.2/envs/my-virtual-env-3.6.2/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3263: DtypeWarning: Columns (3,24,39,43) have mixed types.Specify dtype option on import or set low_memory=False.
  if (await self.run_code(code, result,  async_=asy)):
/Users/sarahluw/.pyenv/versions/3.6.2/envs/my-virtual-env-3.6.2/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3263: DtypeWarning: Columns (19,30,31) have mixed types.Specify dtype option on import or set low_memory=False.
  if (await self.run_code(code, result,  async_=asy)):
/Users/sarahluw/.pyenv/versions/3.6.2/envs/my-virtual-env-3.6.2/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3263: DtypeWarning: Columns (8) have mixed types.Specify dtype option on import or set low_memory=False.
  if (await self.run_code(code, result,  async_=asy)):
/Users/sarahluw/.pyenv/versions/3.6.2/envs/my-virtual-env-3.6.2/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3263: DtypeWarning: Columns (6) have mixed types.Specify dtype option on import or set low_memory=False.
  if (await self.run_code(code, result,  async_=asy)):

# none of these tables contain the unit of measurement we are after.....
# although they do offer ample and comprehensive results

[df.columns for df in dfs_list]

[Index(['CASNumber', 'CASStandard', 'ChemicalNumber', 'Chemical', 'MCL1988',
        'MCL1989', 'MCL1990', 'MCL1991', 'MCL1992', 'MCL1993', 'MCL1994',
        'MCL1995', 'MCL1996', 'MCL1997', 'MCL1998', 'MCL1999', 'MCL2000',
        'MCL2001', 'MCL2002', 'MCL2003', 'MCL2004', 'MCL2005', 'MCL2006',
        'MCL2007', 'MCL2008', 'MCL2009', 'MCL2010', 'MCL2011', 'MCL2012',
        'MCL2013', 'MCL2014', 'MCL2015', 'MCL2016', 'MCL2017', 'MCL2018',
        'MCL2019'],
       dtype='object'),
 Index(['OffsiteID', 'FacilityNumber', 'TRIFID', 'DropIncinerator',
        'POTW_Incin', 'Name', 'Street', 'City', 'State', 'ZIPCode', 'ZIP9',
        'Latitude', 'Longitude', 'GridCode', 'Country', 'X', 'Y',
        'RadialDistance', 'StackHeight', 'StackVelocity', 'StackDiameter',
        'StackTemperature', 'StackParameterSource', 'NAICS', 'FRSID', 'FRSYear',
        'HEM3ID', 'DistanceToHEM3', 'WBANID', 'DistanceToWBAN', 'WaterReleases',
        'OutfallLongitude', 'OutfallLatitude', 'NearReach', 'NearComID',
        'DistanceToReach', 'AssignedReach', 'AssignedComID', 'ReachSource',
        'ReachNotes', 'LatLongSource', 'LatLongYear', 'LockLL',
        'CentroidAdjustment', 'Notes_on_coordinates',
        'AdditionalSourcesForLocation', 'LocationConfidence', 'Foreign'],
       dtype='object'),
 Index(['FacilityID', 'FacilityNumber', 'FRSID', 'Latitude', 'Longitude',
        'LLOverridden', 'GridCode', 'X', 'Y', 'LatLongSource', 'LLYear',
        'LLNotes', 'RadialDistance', 'FacilityName', 'Street', 'City', 'County',
        'State', 'ZIPCode', 'ZIP9', 'FIPS', 'STFIPS', 'DUNS', 'ParentDUNS',
        'ParentName', 'StandardizedParentCompany', 'Region',
        'FederalFacilityFlag', 'FederalAgencyName', 'PublicContactName',
        'PublicContactPhone', 'PublicContactPhoneExtension',
        'PublicContactEmail', 'StackHeight', 'StackVelocity', 'StackDiameter',
        'StackTemperature', 'StackHeightSource', 'StackVelocitySource',
        'StackDiameterSource', 'StackTemperatureSource', 'StackHeightNEIYear',
        'StackVelocityNEIYear', 'StackDiameterNEIYear',
        'StackTemperatureNEIYear', 'ChromHexPercent', 'ChromSource',
        'NewIndustryFlag', 'ModeledNAICS', 'NAICSCode3Digit', 'NAICSCode4Digit',
        'NAICSCode5Digit', 'NAICS1', 'NAICS2', 'NAICS3', 'NAICS4', 'NAICS5',
        'NAICS6', 'SIC1', 'FinalReach', 'FinalCOMID', 'ReachSource',
        'DistanceToReach', 'HEM3ID', 'DistanceToHEM3', 'LLConfirmed',
        'WaterReleases', 'ModeledReleases', 'ModChromReleases'],
       dtype='object'),
 Index(['DCN', 'SubmissionNumber', 'FacilityID', 'FacilityNumber', 'CAS',
        'ChemicalNumber', 'SubmissionYear', 'MPOUse', 'MaxOnsite', 'EntireFac',
        'OneTimeReleaseQty', 'OneTimeReleaseQtyNA', 'PartialFac',
        'TradeSecretInd'],
       dtype='object'),
 Index(['2017NAICSCode', 'LongName', 'Changed2017', 'TRIIndustrySector',
        'IndustrySubsector', '4DigitNAICS', 'NewIndustry'],
       dtype='object'),
 Index(['ReleaseNumber', 'SubmissionNumber', 'Media', 'PoundsReleased',
        'OffsiteNumber', 'TEF', 'TEFSource', 'DCN', 'TransferLocNum',
        'OffsiteAmountSequence', 'WaterSequenceNum'],
       dtype='object'),
 Index(['ElementNumber', 'ReleaseNumber', 'PoundsPT', 'ScoreCategory', 'Score',
        'Population', 'ScoreA', 'PopA', 'ScoreB', 'PopB', 'ScoreC', 'PopC',
        'ScoreD', 'PopD', 'ScoreE', 'PopE', 'NCScore', 'CScore', 'Hazard',
        'HazardC', 'HazardNC'],
       dtype='object'),
 Index(['Media', 'MediaText', 'TRIEnvironmentalMedium', 'Itw', 'Otw', 'Mtw',
        'MediaCode', 'TRICode', 'TRICategory', 'LongDescription',
        'AggDescription'],
       dtype='object'),
 Index(['ScoreCategory', 'Category', 'Model', 'InhaleTox'], dtype='object'),
 Index(['CASNumber', 'CASStandard', 'ChemicalNumber', 'TRIChemID',
        'MetalCombinedChemNum', 'Category', 'SortCAS', 'SortName',
        'FullChemicalName', 'Chemical', 'Added', 'ToxicitySource', 'RfCInhale',
        'RfCUF', 'RfCMF', 'RfCConf', 'RfCSource', 'RfCListingDate',
        'RfCToxWeight', 'RFDOral', 'RfDUF', 'RfDMF', 'RfDConf',
        'RfDListingDate', 'RfDSource', 'RfDToxWeight', 'UnitRiskInhale',
        'QSTAROral', 'WOE', 'UnitRiskListingDate', 'UnitRiskSource',
        'IURToxWeight', 'QStarListingDate', 'QStarSource', 'OSFToxWeight',
        'WOEListingDate', 'WOESource', 'ITW', 'OTW', 'ToxicityClassOral',
        'ToxicityClassInhale', 'ToxicityCategory', 'AirDecay', 'Koc',
        'H2ODecay', 'LOGKow', 'Kd', 'WaterSolubility', 'POTWPartitionRemoval',
        'POTWPartitionSludge', 'POTWPartitionVolat', 'POTWPartitionBiod',
        'IncineratorDRE', 'BCF', 'Henrys', 'MCL', 'MolecularWeight', 'HAPFlag',
        'CAAFlag', 'PriorityPollutantFlag', 'SDWAFlag', 'CERCLAFlag',
        'OSHACarcinogens', 'ExpansionFlag', 'Core88ChemicalFlag',
        'Core95ChemicalFlag', 'Core98ChemicalFlag', 'Core00ChemicalFlag',
        'Core01ChemicalFlag', 'HPVFlag', 'HPVChallenge', 'PBTFlag', 'Metal',
        'HasTox', 'MaxTW', 'MaxNC', 'MaxC', 'Notes'],
       dtype='object')]

Disaggregated Microdata

Values represent columns here

# Lazily reading files into single Dask DataFrame (~21 GB)
import dask.dataframe as dd

ddf_direct = dd.read_csv(
    "/Users/sarahluw/Downloads/micro2019_2019.csv.gz",
    blocksize=None, 
    compression='gzip',
    sample=100
)

# first one is the GridCode
# second X-coordinate of grid cell
# third Y-Coordinate of grid cell
ddf_direct.columns

Index(['44', '-58', '-4', '6457356', '519', '16', '1', '2.17067E-05',
       '7.59735E-05', '0.00000E+00', '0.00000E+00.1', '0.00000E+00.2',
       '0.00000E+00.3'],
      dtype='object')

CSV Aggregate for Tract Level Geographic Unit - after exchange with contact at EPA

import pandas as pd

url = "http://abt-rsei.s3.amazonaws.com/microdata2019/census_agg/CensusMicroTracts2019_2019_aggregated.zip"
df = pd.read_csv(url, header=0)

columns = [
        "GEOID10", 
        "NUMFACS",
        "NUMRELEASES",
        "NUMCHEMS",
        "TOXCONC",
        "SCORE",
        "POP",
        "CSCORE",
        "NCSCORE"
]
# the columns are actually a census tract's data at this point
# The subsequent step make the current columns the first row
# of the database
df.columns

Index(['36047023200', '193', '904', '96', '4.85177E+002', '7.59714E+002',
       '6.19922E+003', '6.10315E+002', '1.54848E+002'],
      dtype='object')

first_row = dict(zip(columns, df.columns))

# add first row to dataframe
row_df = pd.DataFrame(first_row, index=[0])
df.columns = columns
# simply concatenate both dataframes
df = pd.concat([row_df, df]).reset_index(drop = True)
df.head(5)

	GEOID10	NUMFACS	NUMRELEASES	NUMCHEMS	TOXCONC	SCORE	POP	CSCORE	NCSCORE
0	36047023200	193	904	96	4.85177E+002	7.59714E+002	6.19922E+003	6.10315E+002	1.54848E+002
1	36029007202	84	376	81	227.366	79.5029	1601.87	63.0466	14.8128
2	42003429201	144	1624	251	2705.16	2553.95	4527.6	2447.72	113.751
3	17201004100	80	365	57	3626.62	3984.97	5766.72	3969.13	27.0562
4	47065011203	79	358	79	1133.4	1885.51	7820.66	1410.31	447.775

usds / justice40-tool