musa-5090-spring-2024 / musa5090s24-team3-cama

0 stars 7 forks source link

Roles

Philadelphia Computer-Assisted Mass Appraisal (CAMA) Project

The following is a description of the various infrastructural components of the project. The names and schemas used here are important, as each team will be working on different components of the project. The names and schemas are designed to be consistent across teams, so that the various components can be integrated together. If you have any questions about the names or schemas, please ask!

The overview slides for the project are available here.

Cloud Storage Buckets

Since bucket names must be globally unique, each name contains team<N> which should be replace with the actual team number (i.e., team1 for orange, and team2 for purple). The buckets are organized as follows:

Bucket name Description of contents
musa5090s24_team<N>_raw_data Raw data from the sources.
musa5090s24_team<N>_prepared_data Data prepared for external tables in BigQuery.
musa5090s24_team<N>_temp_data Temporary data used during processing. Files stored here will be deleted after a few days.
musa5090s24_team<N>_public Files that are accessible to the public. These files are primarily inteded to be consumed by the assessment review and property information apps.

Raw Data

In the musa5090s24_team<N>_raw_data bucket, there are three folders. Each folder may contain one or multiple files, depending on the type of data loaded (for example, a shapefile may be stored as multiple files). The folders are as follows:

Folder Contents
/opa_properties/ Contains the "properties" data downloaded from the Philadelphia Properties and Assessment History dataset on OpenDataPhilly.
/opa_assessments/ Contains the "property assessment history" data downloaded from the Philadelphia Properties and Assessment History dataset on OpenDataPhilly.
/pwd_parcels/ Contains the "PWD parcels" data downloaded from the PWD Stormwater Billing Parcels dataset on OpenDataPhilly.

Prepared Data

In the musa5090s24_team<N>_prepared_data bucket, there are three folders. Each should contain a single file named data.jsonl. The folders are as follows:

Folder Contents
/opa_properties/ Contains the "properties" data prepared for external tables in BigQuery.
/opa_assessments/ Contains the "property assessment history" data prepared for external tables in BigQuery.
/pwd_parcels/ Contains the "PWD parcels" data prepared for external tables in BigQuery.

Public Files

The musa5090s24_team<N>_public bucket contains files that are formatted to be used in the assessment review dashboard. These files are accessible via public URLs. Files used in this way (as artifacts for a web application) are often called "assets".

Folder Contents
/tiles/ Map tiles for the assessment reviewer
/configs/ Configuration files for the charts and other parts of the assessment reviewer interface.

BigQuery

Datasets

Dataset Description
source External tables backed by prepared source data in Cloud Storage.
core Data that is ready to be used for analysis. For the most part, the tables here are just copies of the external tables.
derived Data that has been derived from core data. Outputs from analyses or models go here.

Tables

Source Tables

The tables in source are external tables. The data is stored in JSON-L files in the musa5090s24_team<N>_prepared_data Cloud Storage bucket.

Core Tables

There's a correlated table in core for each table in source. Even though external tables are convenient for getting data into BigQuery, they're not the most efficient to query from. So, we copy the data into a table in core and query from there.

In addition to the fields from the raw tables, each of the core tables will have a property_id field (derived from the OPA or BRT number) that can be used as the unique identifier for a property across the tables.

Derived Tables

The derived schema contains all-new tables with data based on analyses and predictions. Below, each table is listed with the minimal fields to include. The tables may have more fields than these (e.g., for current_assessments we may choose to store confidence interval data), but they will definitely have at least the listed fields.

Cloud Functions

Extracting

Each of these tasks download data from some source and store the data in musa5090s24_team<N>_raw_data.

Preparing (i.e. little-t transforming)

Each of these tasks read raw stored data from GCS and converts it into a clean CSV file in musa5090s24_team<N>_prepared_data.

Loading

Each of these creates (or replaces) an external table in the source dataset in BigQuery, and creates or replaces a copy of that data in the core dataset.

Transforming (big-T)

Each of these tasks reads data from the core dataset, performs some analysis or aggregation, and writes the results to the derived dataset.

Generating assets

Cloud Run jobs

To run the assessment prediction across all the properties, we use a Cloud Run job named predict-current-assessments. The script reads the derived.assessment_inputs table and writes the results to the derived.current_assessments table.

To create the vector tiles for the assessment review dashboard, we use a Cloud Run job named generate-property-map-tiles. The script downloads the property_tile_info.geojson file from the musa5090s24_team<N>_temp_data bucket, generates vector tiles, and writes the results to the musa5090s24_team<N>_public/tiles folder.

Workflows

For extracting, loading, transforming, and predicting on data, there is a single workflow named data-pipeline. This workflow is triggered by a Cloud Scheduler job that runs once per week.

Service Accounts

In each project, a service account named data-pipeline-user was created to provide necessary access to different GCP services. The following roles are assigned to the service account: