Closed lucasmbrown-usds closed 2 years ago
Sample instructions from the documentation:
For example, if you filter your data by selecting "325 Chemicals" and "326 Plastics and Rubber" from the Industry Sector field, the subset of data will be any records where "325 Chemicals" or "326 Plastics and Rubber" is the value in the Industry Sector field - "325 Chemicals" and "326 Plastics and Rubber" are both in the selected state. All other values in the Industry Sector field are alternative states. Within "325 Chemicals" and "326 Plastics and Rubber" are many subsectors that are denoted as possible states because, though not selected, they are possible given the current selection state.
This begs several questions for further investigation and for clarification:
Looks like there's a single aggregate score they produce: https://www.epa.gov/rsei/understanding-rsei-results#what.
Hi Lucas,
As we discussed the three questions that I raise - and that are manifested in the SEI score and some additional notes that address (2) above:
RSEI Scores do not describe a level of risk (such as the number of excess lung cancer cases). RSEI does not represent risk-assessments; rather to indicate area of further analysis - which is why I raised (1), above.
From the documents: RSEI Scores are designed to be compared to each other. A RSEI Score 10 times higher than another RSEI Score suggests that the potential for risk is 10 times higher. Relatively small chemical releases may lead to high RSEI Scores if the toxicity weight is particularly high or if the modeled exposed population is large. Conversely, large chemical releases may lead to low RSEI Scores if the toxicity weight is low or if the modeled exposed population is small. High RSEI Scores identify areas that may warrant further investigation, and do not conclusively demonstrate sources of risk. A facility with a high RSEI Score may not be the primary source of chronic human health risk in an area. RSEI Scores point to chemical releases that may warrant further investigation.
There seems to be be an FTP server that requires authentication for the census tract microdata at "newftp.epa.gov"
This provides details on how RSEI should be used: https://www.epa.gov/rsei/how-rsei-should-be-used.
Below is the summary of basic exploration of the reporting year 2019. According to the associate that I was in contact with they have not produced census tract files for any year but 2019.
According to the associate, in the future they will have 2020 data, which will have all the years of data by census tract up to 2020. They will host this data on another production server - this dataset is too large for the FTP host. Generally, the way to access the data is to contact us, and they will provide download links.
import pandas as pd
import os
import ftplib
ftp = ftplib.FTP('newftp.epa.gov')
ftp.login()
'230 User logged in, proceed.'
Across many of tables it appears the foreign key is the DCN (Unique identifier assigned by TRI to each facility submission - otherwise known as the document control number).
The source documentation with the associated tables is found here, page 20.
def ftp_as_zip(file_name, file_path, host, data_location = '.'):
'''This is the generic function for accessing the ftp server and saving a designated
file from the host server to a local machine
'''
print("Connecting to: ", host)
ftp = ftplib.FTP(host)
ftp.login()
ftp.cwd(file_path)
print("Data will be peristed to: " + data_location)
if not os.path.exists(data_location):
os.makedirs(data_location)
print("writing file: " + file_name)
with open(os.path.join(data_location, file_name), "wb") as f_handler:
# this is weird, I know. The second argument is a callback
# function - which in this instance to call the file handler
ftp.retrbinary('RETR %s' % file_name, f_handler.write)
ftp.quit()
# Example Usage - Note this is just an example and does
# not represent the current files we seek
file_name = 'poly_gc14_conus_810m_top.dbf' #downloading from EPA public FTP site
file_path='RSEI/Shapefiles/810m_Standard_Grid_Shapefiles/' # path on host server
host = 'newftp.epa.gov' # host
ftp_as_zip(file_name, file_path, host)
Connecting to: newftp.epa.gov
Data will be peristed to: .
writing file: poly_gc14_conus_810m_top.dbf
ftp.cwd("RSEI/Version239_RY2019/")
'250 Directory changed to /RSEI/Version239_RY2019'
4 directories
files = ftp.nlst()
len(files)
4
files
['Aggregated_Grid_Cell_Data',
'Disaggregated_Microdata',
'Public_Release_Data',
'RSEI_Queries']
RSEI crosswalks and shapefiles. RSEI crosswalks allow you to transpose data from the RSEI grid to Census blocks, and shapefiles provide grid geometry for mapping microdata
import geopandas
import pandas as pd
/Users/sarahluw/.pyenv/versions/3.6.2/envs/my-virtual-env-3.6.2/lib/python3.6/site-packages/geopandas/_compat.py:110: UserWarning: The Shapely GEOS version (3.9.1-CAPI-1.14.2) is incompatible with the GEOS version PyGEOS was compiled with (3.10.1-CAPI-1.16.0). Conversions between both will be slow.
shapely_geos_version, geos_capi_version_string
gdf = geopandas.read_file("poly_gc54_guam_810m.shp")
gdf.head()
CELLX | CELLY | CLAT | CLONG | CX | CY | geometry | |
---|---|---|---|---|---|---|---|
0 | -1400 | 1814 | 13.181812 | 144.532682 | -1134000 | 1469340 | POLYGON ((144.52879 13.18531, 144.53626 13.185... |
1 | -1399 | 1814 | 13.182124 | 144.540150 | -1133190 | 1469340 | POLYGON ((144.53626 13.18562, 144.54372 13.185... |
2 | -1398 | 1814 | 13.182436 | 144.547618 | -1132380 | 1469340 | POLYGON ((144.54372 13.18594, 144.55119 13.186... |
3 | -1397 | 1814 | 13.182747 | 144.555085 | -1131570 | 1469340 | POLYGON ((144.55119 13.18625, 144.55866 13.186... |
4 | -1396 | 1814 | 13.183058 | 144.562553 | -1130760 | 1469340 | POLYGON ((144.55866 13.18656, 144.56613 13.186... |
A DBF file is a standard database file used by dBASE, a database management system application. It organizes data into multiple records with fields stored in an array data type. DBF files are also compatible with other "xBase" database programs, which arose because of the file format's popularity.
import dbf
table = dbf.Table('poly_gc14_conus_810m_top.dbf')
# these do not reflect the level of geographic measurement
table.field_names
FieldnameList(['CELLX', 'CELLY', 'CLAT', 'CLONG', 'CX', 'CY'])
However, it is in fact this
import pandas as pd
# hmmmm, okay, so no header -
# no problem, let's see if we just use the file handler from gzip (further down)
data = pd.read_csv('/Users/sarahluw/Downloads/aggmicro2019_2019_gc14.csv.gz',
nrows=10, compression='gzip', header=1,
error_bad_lines=False)
data.head()
1745 311 1 7 4 6.03996E-003 0.00000E+000 0.00000E+000.1 \
0 1746 311 1 7 4 0.006368 0.0 0.0
1 1747 311 1 7 4 0.006731 0.0 0.0
2 1748 311 1 7 4 0.007100 0.0 0.0
3 1749 311 1 7 4 0.007324 0.0 0.0
4 1750 311 1 7 4 0.007367 0.0 0.0
5 1751 311 1 7 4 0.007427 0.0 0.0
6 1752 311 1 7 4 0.007650 0.0 0.0
7 1753 311 1 7 4 0.007895 0.0 0.0
8 1754 311 2 11 6 0.020016 0.0 0.0
9 1755 311 2 11 6 0.021184 0.0 0.0
0.00000E+000.2 0.00000E+000.3
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0
5 0.0 0.0
6 0.0 0.0
7 0.0 0.0
8 0.0 0.0
9 0.0 0.0
import gzip
with gzip.open('/Users/sarahluw/Downloads/aggmicro2019_2019_gc14.csv.gz', 'rb') as f:
data = pd.read_csv(f)
# easier to read
data.head()
1744 | 311 | 1 | 7 | 4 | 5.71468E-003 | 0.00000E+000 | 0.00000E+000.1 | 0.00000E+000.2 | 0.00000E+000.3 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1745 | 311 | 1 | 7 | 4 | 0.006040 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 1746 | 311 | 1 | 7 | 4 | 0.006368 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 1747 | 311 | 1 | 7 | 4 | 0.006731 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 1748 | 311 | 1 | 7 | 4 | 0.007100 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 1749 | 311 | 1 | 7 | 4 | 0.007324 | 0.0 | 0.0 | 0.0 | 0.0 |
import jaydebeapi
import pandas as pd
import pandas_access as mdb
db_path = "rsei_data3_v239.accdb"
# Listing the tables from Microsoft Access tables
for tbl in mdb.list_tables(db_path):
print(tbl)
Releases
Submissions
# Read a huge table
accumulator = []
for chunk in mdb.read_table(db_path, "Submissions", chunksize=10000):
accumulator.append(chunk)
accumulator[0].head()
DCN | SubmissionNumber | FacilityID | FacilityNumber | CAS | ChemicalNumber | MPOUse | MaxOnsite | Entire_Fac | One_Time_Release_Qty | One_Time_Release_Qty_NA | Partial_Fac | Trade_Secret_Ind | SubmissionYear | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1300140000011 | 1 | 15902CCKRN75BRI | 8527 | N982 | 595 | 111 | 04 | 1 | 0 | 0 | 0 | 0 | 2000 |
1 | 1300140000023 | 2 | 15902CCKRN75BRI | 8527 | 007439921 | 346 | 111 | 02 | 1 | 0 | 0 | 0 | 0 | 2000 |
2 | 1300140000035 | 3 | 70363GLFSL583TH | 45099 | 000108883 | 549 | 011 | 03 | 1 | NaN | 1 | 0 | 0 | 2000 |
3 | 1300140000047 | 4 | 70363GLFSL583TH | 45099 | 001330207 | 592 | 011 | 03 | 1 | NaN | 1 | 0 | 0 | 2000 |
4 | 1300140000050 | 5 | 70363GLFSL583TH | 45099 | 007440666 | 594 | 010 | 03 | 1 | NaN | 1 | 0 | 0 | 2000 |
db_path_2 = "rsei_data1_v239.accdb"
# First Table
for tbl in mdb.list_tables(db_path_2):
print(tbl)
Category
NAICSCode
Sictable
Chemical
Media
Facility
Offsite
db_path_3 = "rsei_data2_v239.accdb"
# Second Table
for tbl in mdb.list_tables(db_path_3):
print(tbl)
Elements
from glob import glob
list_of_files = list(glob(str("RSEIv239_Public_Release_Data/") + "/*.csv"))
dfs_list = [
pd.read_csv(f)
for f in list_of_files
]
/Users/sarahluw/.pyenv/versions/3.6.2/envs/my-virtual-env-3.6.2/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3263: DtypeWarning: Columns (3,24,39,43) have mixed types.Specify dtype option on import or set low_memory=False.
if (await self.run_code(code, result, async_=asy)):
/Users/sarahluw/.pyenv/versions/3.6.2/envs/my-virtual-env-3.6.2/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3263: DtypeWarning: Columns (19,30,31) have mixed types.Specify dtype option on import or set low_memory=False.
if (await self.run_code(code, result, async_=asy)):
/Users/sarahluw/.pyenv/versions/3.6.2/envs/my-virtual-env-3.6.2/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3263: DtypeWarning: Columns (8) have mixed types.Specify dtype option on import or set low_memory=False.
if (await self.run_code(code, result, async_=asy)):
/Users/sarahluw/.pyenv/versions/3.6.2/envs/my-virtual-env-3.6.2/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3263: DtypeWarning: Columns (6) have mixed types.Specify dtype option on import or set low_memory=False.
if (await self.run_code(code, result, async_=asy)):
# none of these tables contain the unit of measurement we are after.....
# although they do offer ample and comprehensive results
[df.columns for df in dfs_list]
[Index(['CASNumber', 'CASStandard', 'ChemicalNumber', 'Chemical', 'MCL1988',
'MCL1989', 'MCL1990', 'MCL1991', 'MCL1992', 'MCL1993', 'MCL1994',
'MCL1995', 'MCL1996', 'MCL1997', 'MCL1998', 'MCL1999', 'MCL2000',
'MCL2001', 'MCL2002', 'MCL2003', 'MCL2004', 'MCL2005', 'MCL2006',
'MCL2007', 'MCL2008', 'MCL2009', 'MCL2010', 'MCL2011', 'MCL2012',
'MCL2013', 'MCL2014', 'MCL2015', 'MCL2016', 'MCL2017', 'MCL2018',
'MCL2019'],
dtype='object'),
Index(['OffsiteID', 'FacilityNumber', 'TRIFID', 'DropIncinerator',
'POTW_Incin', 'Name', 'Street', 'City', 'State', 'ZIPCode', 'ZIP9',
'Latitude', 'Longitude', 'GridCode', 'Country', 'X', 'Y',
'RadialDistance', 'StackHeight', 'StackVelocity', 'StackDiameter',
'StackTemperature', 'StackParameterSource', 'NAICS', 'FRSID', 'FRSYear',
'HEM3ID', 'DistanceToHEM3', 'WBANID', 'DistanceToWBAN', 'WaterReleases',
'OutfallLongitude', 'OutfallLatitude', 'NearReach', 'NearComID',
'DistanceToReach', 'AssignedReach', 'AssignedComID', 'ReachSource',
'ReachNotes', 'LatLongSource', 'LatLongYear', 'LockLL',
'CentroidAdjustment', 'Notes_on_coordinates',
'AdditionalSourcesForLocation', 'LocationConfidence', 'Foreign'],
dtype='object'),
Index(['FacilityID', 'FacilityNumber', 'FRSID', 'Latitude', 'Longitude',
'LLOverridden', 'GridCode', 'X', 'Y', 'LatLongSource', 'LLYear',
'LLNotes', 'RadialDistance', 'FacilityName', 'Street', 'City', 'County',
'State', 'ZIPCode', 'ZIP9', 'FIPS', 'STFIPS', 'DUNS', 'ParentDUNS',
'ParentName', 'StandardizedParentCompany', 'Region',
'FederalFacilityFlag', 'FederalAgencyName', 'PublicContactName',
'PublicContactPhone', 'PublicContactPhoneExtension',
'PublicContactEmail', 'StackHeight', 'StackVelocity', 'StackDiameter',
'StackTemperature', 'StackHeightSource', 'StackVelocitySource',
'StackDiameterSource', 'StackTemperatureSource', 'StackHeightNEIYear',
'StackVelocityNEIYear', 'StackDiameterNEIYear',
'StackTemperatureNEIYear', 'ChromHexPercent', 'ChromSource',
'NewIndustryFlag', 'ModeledNAICS', 'NAICSCode3Digit', 'NAICSCode4Digit',
'NAICSCode5Digit', 'NAICS1', 'NAICS2', 'NAICS3', 'NAICS4', 'NAICS5',
'NAICS6', 'SIC1', 'FinalReach', 'FinalCOMID', 'ReachSource',
'DistanceToReach', 'HEM3ID', 'DistanceToHEM3', 'LLConfirmed',
'WaterReleases', 'ModeledReleases', 'ModChromReleases'],
dtype='object'),
Index(['DCN', 'SubmissionNumber', 'FacilityID', 'FacilityNumber', 'CAS',
'ChemicalNumber', 'SubmissionYear', 'MPOUse', 'MaxOnsite', 'EntireFac',
'OneTimeReleaseQty', 'OneTimeReleaseQtyNA', 'PartialFac',
'TradeSecretInd'],
dtype='object'),
Index(['2017NAICSCode', 'LongName', 'Changed2017', 'TRIIndustrySector',
'IndustrySubsector', '4DigitNAICS', 'NewIndustry'],
dtype='object'),
Index(['ReleaseNumber', 'SubmissionNumber', 'Media', 'PoundsReleased',
'OffsiteNumber', 'TEF', 'TEFSource', 'DCN', 'TransferLocNum',
'OffsiteAmountSequence', 'WaterSequenceNum'],
dtype='object'),
Index(['ElementNumber', 'ReleaseNumber', 'PoundsPT', 'ScoreCategory', 'Score',
'Population', 'ScoreA', 'PopA', 'ScoreB', 'PopB', 'ScoreC', 'PopC',
'ScoreD', 'PopD', 'ScoreE', 'PopE', 'NCScore', 'CScore', 'Hazard',
'HazardC', 'HazardNC'],
dtype='object'),
Index(['Media', 'MediaText', 'TRIEnvironmentalMedium', 'Itw', 'Otw', 'Mtw',
'MediaCode', 'TRICode', 'TRICategory', 'LongDescription',
'AggDescription'],
dtype='object'),
Index(['ScoreCategory', 'Category', 'Model', 'InhaleTox'], dtype='object'),
Index(['CASNumber', 'CASStandard', 'ChemicalNumber', 'TRIChemID',
'MetalCombinedChemNum', 'Category', 'SortCAS', 'SortName',
'FullChemicalName', 'Chemical', 'Added', 'ToxicitySource', 'RfCInhale',
'RfCUF', 'RfCMF', 'RfCConf', 'RfCSource', 'RfCListingDate',
'RfCToxWeight', 'RFDOral', 'RfDUF', 'RfDMF', 'RfDConf',
'RfDListingDate', 'RfDSource', 'RfDToxWeight', 'UnitRiskInhale',
'QSTAROral', 'WOE', 'UnitRiskListingDate', 'UnitRiskSource',
'IURToxWeight', 'QStarListingDate', 'QStarSource', 'OSFToxWeight',
'WOEListingDate', 'WOESource', 'ITW', 'OTW', 'ToxicityClassOral',
'ToxicityClassInhale', 'ToxicityCategory', 'AirDecay', 'Koc',
'H2ODecay', 'LOGKow', 'Kd', 'WaterSolubility', 'POTWPartitionRemoval',
'POTWPartitionSludge', 'POTWPartitionVolat', 'POTWPartitionBiod',
'IncineratorDRE', 'BCF', 'Henrys', 'MCL', 'MolecularWeight', 'HAPFlag',
'CAAFlag', 'PriorityPollutantFlag', 'SDWAFlag', 'CERCLAFlag',
'OSHACarcinogens', 'ExpansionFlag', 'Core88ChemicalFlag',
'Core95ChemicalFlag', 'Core98ChemicalFlag', 'Core00ChemicalFlag',
'Core01ChemicalFlag', 'HPVFlag', 'HPVChallenge', 'PBTFlag', 'Metal',
'HasTox', 'MaxTW', 'MaxNC', 'MaxC', 'Notes'],
dtype='object')]
Values represent columns here
# Lazily reading files into single Dask DataFrame (~21 GB)
import dask.dataframe as dd
ddf_direct = dd.read_csv(
"/Users/sarahluw/Downloads/micro2019_2019.csv.gz",
blocksize=None,
compression='gzip',
sample=100
)
# first one is the GridCode
# second X-coordinate of grid cell
# third Y-Coordinate of grid cell
ddf_direct.columns
Index(['44', '-58', '-4', '6457356', '519', '16', '1', '2.17067E-05',
'7.59735E-05', '0.00000E+00', '0.00000E+00.1', '0.00000E+00.2',
'0.00000E+00.3'],
dtype='object')
import pandas as pd
url = "http://abt-rsei.s3.amazonaws.com/microdata2019/census_agg/CensusMicroTracts2019_2019_aggregated.zip"
df = pd.read_csv(url, header=0)
columns = [
"GEOID10",
"NUMFACS",
"NUMRELEASES",
"NUMCHEMS",
"TOXCONC",
"SCORE",
"POP",
"CSCORE",
"NCSCORE"
]
# the columns are actually a census tract's data at this point
# The subsequent step make the current columns the first row
# of the database
df.columns
Index(['36047023200', '193', '904', '96', '4.85177E+002', '7.59714E+002',
'6.19922E+003', '6.10315E+002', '1.54848E+002'],
dtype='object')
first_row = dict(zip(columns, df.columns))
# add first row to dataframe
row_df = pd.DataFrame(first_row, index=[0])
df.columns = columns
# simply concatenate both dataframes
df = pd.concat([row_df, df]).reset_index(drop = True)
df.head(5)
GEOID10 | NUMFACS | NUMRELEASES | NUMCHEMS | TOXCONC | SCORE | POP | CSCORE | NCSCORE | |
---|---|---|---|---|---|---|---|---|---|
0 | 36047023200 | 193 | 904 | 96 | 4.85177E+002 | 7.59714E+002 | 6.19922E+003 | 6.10315E+002 | 1.54848E+002 |
1 | 36029007202 | 84 | 376 | 81 | 227.366 | 79.5029 | 1601.87 | 63.0466 | 14.8128 |
2 | 42003429201 | 144 | 1624 | 251 | 2705.16 | 2553.95 | 4527.6 | 2447.72 | 113.751 |
3 | 17201004100 | 80 | 365 | 57 | 3626.62 | 3984.97 | 5766.72 | 3969.13 | 27.0562 |
4 | 47065011203 | 79 | 358 | 79 | 1133.4 | 1885.51 | 7820.66 | 1410.31 | 447.775 |
Description RSEI's data is available crosswalked to census tracts:
https://www.epa.gov/rsei/rsei-grid-and-locational-information
This could be a valuable resource for information on pollution sources. It is updated on a yearly basis with a deep historical record, and because the data collection is legislatively mandated, it's very comprehensive.