runt18 / google-bigquery

Automatically exported from code.google.com/p/google-bigquery
0 stars 0 forks source link

Add alter table functionality #466

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Is there any plans to add in alter table operations, such as rename column, add 
column?

Our use case is that we would like to rename a column in our bigquery tables 
which are partitioned by date.
eg. a table named fact would have many partitions such as 
fact_20160301
fact_20160302
fact_20160303... etc

Currently, we have a library of python wrappers written to perform these 
operations.

eg. rename column generates individual queries that do aliasing and then 
perform a write_truncate load onto the parent tables. We do these operations 
with batch executions, however, it's hard to keep track job history. 

We are also limited by the fact that GBQ only supports 50 concurrent jobs. This 
means that for each batch operation we perform, we have do max a month of 
partitions at a time, so as to leave a buffer to other ETL jobs that are 
running. Currently we use the BQ command line tool to check whether or not the 
batch request jobs have succeeded, and it's only until they do can we proceed 
to perform the batch operations on the next set of partitions.

We would like these table features to be added, because right now, there isn't 
a good way to keep track of these batch jobs. 

If there is a way that we can get individual job_id's from a batch request, so 
that we can do some sort of poll_job operation on the batch request, then that 
would also be very helpful. Having built in functionality that supports 
operations such as add column or rename column would be great to have. Are 
these features in the product road map at all?

Original issue reported on code.google.com by yuchen....@tapjoy.com on 15 Mar 2016 at 5:29

GoogleCodeExporter commented 8 years ago
Add column is supported today, have a look at:
https://cloud.google.com/bigquery/docs/tables#updateschema

We don't currently plan to support renaming columns, but this could change in 
the future. I'll pass the feature request along to the team.

If you are using queries to batch transform your data, you should be able to 
issue them in --batch mode to avoid having to do your own queuing (the 50 
concurrent jobs limit applies only to interactive mode for query jobs).

If you are concerned about blocking higher priority jobs with your batch 
operations, you can insert those jobs at the default (interactive) priority and 
they will be executed first.

Trivial example:
# Run a big batch of queries.
for i in `seq 1 50`; do (bq --nosync query --batch --nouse_cache "SELECT 42 as 
batch_test" &); done

# Run an interactive query, it should complete quickly.
bq query "select 99"

Original comment by ada...@google.com on 15 Mar 2016 at 8:39

GoogleCodeExporter commented 8 years ago
Regarding: "If there is a way that we can get individual job_id's from a batch 
request [...]"

You can specify your own job IDs. For example:
bq query --job_id="rename_fact_col_20140102" "SELECT 777"

This makes it easy to wait on job completion (without parsing the response):
bq wait --fail_on_error rename_fact_col_20140102

So your script could queue up all of the rename operations in parallel and then 
wait on each of them to complete.

Original comment by ada...@google.com on 15 Mar 2016 at 8:48