nimbly-dev / nyctripdata_project

Project to learn Data Engineering from: https://github.com/DataTalksClub/data-engineering-zoomcamp
0 stars 0 forks source link

DATAENG-7: Reduce number of temp directories (spark_psql_stage_to_production) #7

Open nimbly-dev opened 1 week ago

nimbly-dev commented 1 week ago

Reduce the number of temp directories in spark_psql_stage_to_production pipeline.

Currently:

  1. pre_lakehouse_to_psql_production
  2. pre_stage_to_prod_psql

Combine 1# and 2#

  1. pre_combined_data_production
  2. pre_combined_clean_data_production

Suggestion:

Remove pre_combined_clean_data_production directory. Persist combined data to pre_combined_data_production instead. If new changes comes, overwrite it.