usds / justice40-tool

A tool to identify disadvantaged communities due to environmental, socioeconomic and health burdens
https://screeningtool.geoplatform.gov/
Creative Commons Zero v1.0 Universal
133 stars 42 forks source link

As a data developer, I want to refactor the method `validate_output` on the base ETL class #1186

Closed lucasmbrown-usds closed 2 years ago

lucasmbrown-usds commented 2 years ago

Description There's a method validate_output on the base ETL class (in data/data-pipeline/data_pipeline/etl/base.py) that checks that the output CSV exists, loads it as a data frame, and checks that all columns in self.COLUMNS_TO_KEEP are present.

This code is a bit of an artifact from an earlier time, and it is currently not called by any code anywhere.

Should we run this every time an ETL is run? Or is the time cost of loading the large file back into memory not worth it?

We should decide what to do with this code. If we don't want to use it, we can delete.

lucasmbrown-usds commented 2 years ago

I decided to delete this code for now because it's very simple and not run anywhere. We can easily add it back when we're ready to integrate it.

The code right before deletion was:


    # TODO: Decide what to do with this code.
    #  See https://github.com/usds/justice40-tool/issues/1186.
    def validate_output(self) -> None:
        """Checks that the output of the ETL process adheres to the contract
        expected by the score module

        Contract conditions:
        - Output is saved as usa.csv at the path specified by self.OUTPUT_PATH
        - The output csv has a column named GEOID10 which stores each of the
          Census block group FIPS codes in data/census/csv/usa.csv
        - The output csv has a column named GEOID10_TRACT which stores each of
          Census tract FIPS codes associated with each Census block group
        - The output csv has each of the columns expected by the score and the
          name and dtype of those columns match the format expected by score
        """
        # read in output file
        # and check that GEOID cols are present
        output_file_path = self._get_output_file_path()
        if not output_file_path.exists():
            raise ValueError(f"No file found at {output_file_path}")

        df_output = pd.read_csv(
            output_file_path,
            dtype={
                # Not all outputs will have both a Census Block Group ID and a
                # Tract ID, but these will be ignored if they're not present.
                self.GEOID_FIELD_NAME: "string",
                self.GEOID_TRACT_FIELD_NAME: "string",
            },
        )

        # check that the score columns are in the output
        for col in self.COLUMNS_TO_KEEP:
            assert col in df_output.columns, f"{col} is missing from output"
lucasmbrown-usds commented 2 years ago

Completed as part of #1075.