opensource-observer / oso

Measuring the impact of open source software
https://opensource.observer
Apache License 2.0
64 stars 14 forks source link

Partition the goldsky dagster assets #1668

Open ravenac95 opened 2 months ago

ravenac95 commented 2 months ago

What is it?

Goldsky assets are currently monolithically loaded from raw files into temporary tables then merged into the final (see below).

If we partition hourly/daily such that we are constantly tracking each set of files as an asset. We can more reliably retry asset materializations as well as tracking the files to clean (once we determine a partition is safe to clean). Then from there we can consistently attempt to remerge the asset files into the fully merged form if there are failures. This can either be done by making the last two boxes into their own assets or as a single asset. Once the merged asset has completed the files from the first asset can then be safely deleted from our gcs bucket and we can clean the partitions in dagster as well.

ryscheng commented 2 months ago

According to @ravenac95 this is a nice to have, not necessary until we start experiencing bugs in how we do pointer management in BigQuery, adding partitioned keys (for every hour) that Dagster could understand

The conditions where this could be useful:

  1. If we find a lot of missing data, accidentally deleting data - exact tracking on which files to reload