Closed lucasmbrown-usds closed 2 years ago
I decided to delete this code for now because it's very simple and not run anywhere. We can easily add it back when we're ready to integrate it.
The code right before deletion was:
# TODO: Decide what to do with this code.
# See https://github.com/usds/justice40-tool/issues/1186.
def validate_output(self) -> None:
"""Checks that the output of the ETL process adheres to the contract
expected by the score module
Contract conditions:
- Output is saved as usa.csv at the path specified by self.OUTPUT_PATH
- The output csv has a column named GEOID10 which stores each of the
Census block group FIPS codes in data/census/csv/usa.csv
- The output csv has a column named GEOID10_TRACT which stores each of
Census tract FIPS codes associated with each Census block group
- The output csv has each of the columns expected by the score and the
name and dtype of those columns match the format expected by score
"""
# read in output file
# and check that GEOID cols are present
output_file_path = self._get_output_file_path()
if not output_file_path.exists():
raise ValueError(f"No file found at {output_file_path}")
df_output = pd.read_csv(
output_file_path,
dtype={
# Not all outputs will have both a Census Block Group ID and a
# Tract ID, but these will be ignored if they're not present.
self.GEOID_FIELD_NAME: "string",
self.GEOID_TRACT_FIELD_NAME: "string",
},
)
# check that the score columns are in the output
for col in self.COLUMNS_TO_KEEP:
assert col in df_output.columns, f"{col} is missing from output"
Completed as part of #1075.
Description There's a method
validate_output
on the base ETL class (indata/data-pipeline/data_pipeline/etl/base.py
) that checks that the output CSV exists, loads it as a data frame, and checks that all columns inself.COLUMNS_TO_KEEP
are present.This code is a bit of an artifact from an earlier time, and it is currently not called by any code anywhere.
Should we run this every time an ETL is run? Or is the time cost of loading the large file back into memory not worth it?
We should decide what to do with this code. If we don't want to use it, we can delete.