nimbly-dev / nyctripdata_project

Project to learn Data Engineering from: https://github.com/DataTalksClub/data-engineering-zoomcamp
0 stars 0 forks source link

DATAENG-6: Optimize spark_get_tripdata_from_psql codeblock #6

Open nimbly-dev opened 1 week ago

nimbly-dev commented 1 week ago

Currently the codeblock spark_get_tripdata_from_psql is not efficient and taking a longer time. Suggestions:

  1. Instead of selecting all the partition table. Use FROM {partittion_name} instead. This will make use of the partitions already exist on the table.
  2. Dynamic partition depending on workers and its cores.

Current runtime: 798.432s or 13.3072 minutes

If done, paste the image in here that it is working and include the runtime if there are any improvements.