Choosing partitioning details

https://cloud.google.com/bigquery/docs/partitioned-tables

daily/hourly/monthly?

https://cloud.google.com/bigquery/docs/partitioned-tables#choosing_daily_hourly_monthly_or_yearly_partitioning

Daily

Is data is spread out over a wide range of dates?
- Not yet*. The project aims to deal with a wide range of forecast history and spread out evenly after long periods of ingestion.
Is data is continuously added over time.?
- Yes. The project ingests hourly forecasts everyday.

hourly

Do the tables have a high volume of data that spans a short date range — typically less than six months of timestamp values?
- No. The project spans a wide date range, definitely more than six months of timestamp values.

monthly/yearly

Do the tables have a relatively small amount of data for each day, but span a wide date range?
- Yes. The huge data are expected from stacking relatively small data over extended periods of time.
your workflow requires frequently updating or adding rows that span a wide date range (for example, more than 500 dates)
- No. Update/insert workflows span a short date range(1-2 dates)

early conclusion

daily or monthly seems reasonable

joons5834 commented 2 years ago

Partitioning versus clustering

https://cloud.google.com/bigquery/docs/partitioned-tables#partitioning_versus_clustering

clustering

You do not need strict cost guarantees before running the query?
- True.
Do you need more granularity than partitioning alone allows? To get clustering benefits in addition to partitioning benefits, you can use the same column for both partitioning and clustering.
- Yes. Other than time, the project needs granularity by weather service, region, forecast time etc.
Do your queries commonly use filters or aggregation against multiple particular columns?
- Yes.
Does the (cardinality)[https://en.wikipedia.org/wiki/Cardinality_(SQL_statements)] of the number of values in a column or group of columns is large?
- No. Since extreme weathers are rare, numeric values have normal-cardinality, and categorical values have low-cardinality.

partitioning

Do you want to know query costs before a query runs?
- Yes.
Do you need partition-level management? For example, you want to set a partition expiration time, load data to a specific partition, or delete partitions.
- want to load data to a specific partition
Do you want to specify how the data will be partitioned and what data is in each partition? For example, you want to define time granularity or define the ranges used to partition the table for integer range partitioning.
- Yes. I'm looking for ways to construct yearly/quarterly/monthly report for my application. Time-partitioned tables will support such use cases.

Prefer clustering over partitioning

under the following circumstances:

Partitioning results in a small amount of data per partition (approximately less than 1 GB).
- True for monthly partitioning
- hourly forcast data of KMA per location: 8KB per hour
- 8KB 24h 30d = 5.76 MB per location per month for KMA
- assuming 5 weather services * 20 locations => approx. 580MB per month without compression
Partitioning results in a large number of partitions beyond the limits on partitioned tables.
- False. Given max of 4000 partitions per table, even daily partitions will last for 10 years, weekly 83 years, monthly 333 years, which is more than enough for this project!
Partitioning results in your mutation operations modifying the majority of partitions in the table frequently (for example, every few minutes).
- False. Once loaded, the majority of partitions are hardly modified.

joons5834 commented 2 years ago

In case things can go wrong and the decision has to change

How to migrate a non-partitioned table into a partitioned one

CREATE TABLE
  mydataset.newtable (transaction_id INT64, transaction_date DATE)
PARTITION BY
  transaction_date
AS SELECT transaction_id, transaction_date FROM mydataset.mytable

Creating a clustered table from a query result

bq query --use_legacy_sql=false \
'CREATE TABLE
   mydataset.myclusteredtable
 PARTITION BY
   DATE(timestamp)
 CLUSTER BY
   customer_id AS
 SELECT
   *
 FROM
   `mydataset.mytable`'

Modifying clustering specification

You can change the clustering specification in the following ways:

Call the tables.update or tables.patch API method.

Call the bq command-line tool's bq update command with the --clustering_fields flag.

joons5834 commented 2 years ago

Possible solutions

Monthly partition without clustering
- each partition maybe less than 1GB
Monthly partition with clustering field (weather service)
- good option if no. of weather services and locations are expected to be on the rise beyond the assumption.
Yearly partition with clustering field(month, weather service, region)
- must populate derivative value month during load/insert for performance boost.
- Yearly partition with clustering field(timestamp, weather service, region, # of minutes before forecast)
- To get clustering benefits in addition to partitioning benefits, you can use the same column for both partitioning and clustering.

joons5834 / weather-forecast-accuracy

partitioned/clustered table #25

Choosing partitioning details

daily/hourly/monthly?

Daily

hourly

monthly/yearly

early conclusion

Partitioning versus clustering

clustering

partitioning

Prefer clustering over partitioning

In case things can go wrong and the decision has to change

Possible solutions