joons5834 / weather-forecast-accuracy

0 stars 0 forks source link

data pipeline design: EL, ELT or ETL? #24

Open joons5834 opened 3 years ago

joons5834 commented 3 years ago

Overview of EL, ELT or ETL:

https://coursera.org/share/98cbe98dfaef4d453c501ef5604855b9

EL

EL is extract and load. This refers to when data can be imported as is into a system. Examples include importing data from a database where the source and the target have the same schema. Only use EL if the data are already clean and correct.

ELT

ELT allows raw data to be loaded directly into the target and transformed wherever it is needed. For example, you might provide access to the raw data through a view that determines whether a user wants to see all transactions or only reconciled ones. One common case is when you don't know what kinds of transformations are needed to make the data usable. This works if the transformation that's needed can be expressed in SQL

ETL

When the amount of transformation you need is significant, you might want to bring in the heavy machinery. That's ETL. Extract, transform, and load is a data integration process in which transformation takes place in an intermediate surface before it's loaded into the target. For example, the data might be transformed in Cloud Dataflow before being loaded into BigQuery. If the transformations cannot be expressed in SQL or are too complex to do in SQL, you might want to transform the data before loading it into BigQuery. image

joons5834 commented 3 years ago

Possible solutions

  1. ELT

    • load raw data as BigQuery table with little or no transformation (ELT)
    • make intermediate view/materialized view/table with SQL
    • create sum table with third party tools(scikit-learn, Spark) to compute complicated metrics hardly supported by SQL.
  2. ETL

    • raw data stays as-is in GCS
    • make intermediate view/materialized view/table with External Table
    • create sum table with third party tools(scikit-learn, Spark) to compute complicated metrics.
    • The performance of External Table might be inferior to that of querying BigQuery table
    • The source data must be JSON (newline-delimited) in order for External Table to work.
  3. ETL

    • read and process raw data in GCS to make a intermediate table in BigQuery
    • create sum table with third party tools(scikit-learn, Spark) to compute complicated metrics hardly supported by SQL.
joons5834 commented 3 years ago

Testing procedure

  1. create table (using Python client script)
  2. load test var
  3. see if it's possible to populate intermediate table with SQL
joons5834 commented 3 years ago

The project does not need streaming The source data needs a simple transformation (not a newline-delimited JSON)