gluent / goe

GOE: a simple and flexible way to copy data from an Oracle Database to Google BigQuery.
Apache License 2.0
8 stars 2 forks source link

Improve backend numeric partitioning on BigQuery #167

Open nj1973 opened 5 months ago

nj1973 commented 5 months ago

Take this example Oracle partition scheme:

PARTITION BY RANGE ("DT_KEY")
…
 (PARTITION "P2005"  VALUES LESS THAN (20060101),
 PARTITION "P2006"  VALUES LESS THAN (20070101)
…

Currently in addition to lower/upper bounds we need to pass Offload --partition-granularity to define the partition step, for example:

./offload -t OWNER.FACT -x \
--partition-granularity=10000 \
--partition-lower-value=20000000 \
--partition-upper-value=30000000

This adds a synthetic partition key:

PARTITION BY RANGE_BUCKET(`GOE_PART_10000_DT_KEY`, GENERATE_ARRAY(20000000, 30000000, 10000))

And tuncates the synthetic values by 10,000. This is a hangover from Hadoop partitioning.

Instead we should be able to use the real column for partitioning and add a new step option, --partition-granularity, independent of the granularity. Then we could partition by:

PARTITION BY RANGE_BUCKET(`DT_KEY`, GENERATE_ARRAY(20000000, 30000000, 10000))