musa-5090-spring-2024 / musa5090s24-team3-cama

0 stars 7 forks source link

Create a pipeline to extract and load the OPA Properties dataset into BigQuery #1

Open mjumbewu opened 3 months ago

mjumbewu commented 3 months ago

All ingest processes in this project will follow this general pattern:

  1. Fetch data and store in a folder in the musa509s24_${team}_raw_data bucket
  2. Convert data to a BigQuery-friendly external table format (e.g. JSON-L) and store in a folder in the musa509s24_${team}_prepared_data bucket
  3. Create (or replace) a BigQuery external table in the source dataset based on the data in gs://musa509s24_${team}_prepared_data
  4. Create (or replace) a regular BigQuery table in the core dataset that has at least one additional column added named property_id. E.g.:
    CREATE OR REPLACE core.opa_properties
    AS (
        SELECT
            parcel_number AS property_id,
            *
        FROM source.opa_properties
    )

Your SQL commands should each be stored in their own files (e.g. source_phl_opa_properties.sql and core_phl_opa_properties.sql), but should be run from a Cloud Function as part of your pipeline. For an example, see the run_sql Cloud Function code at https://github.com/musa-5090-spring-2024/course-info/tree/main/week08/explore_phila_data/run_sql

Acceptance Criteria:

YinanLi-15 commented 3 months ago

Finish