spotify / spark-bigquery

Google BigQuery support for Spark, SQL, and DataFrames
Apache License 2.0
155 stars 52 forks source link

Partition table by date #13

Open samelamin opened 8 years ago

samelamin commented 8 years ago

Hi guys

Really awesome support on this connector, I appreciate it

I was wondering if there is a flag to enable partitioning by date. I see google analytics use that

Since Big Query charges per query you would ideally want your dataset small to keep your costs low

It would be very useful if we can partition tables by dates. Google docs here has more details

Thoughts?

nevillelyh commented 8 years ago

IIRC there's no special treatment of partitioned table in BQ API. Instead you should just use special query syntax like this: https://cloud.google.com/bigquery/docs/querying-partitioned-tables?

samelamin commented 8 years ago

Sorry maybe I phrased the question wrong when saving a dataframe to a BQ table, is it possible to flag this table as partitioned by date

I can create the table manually but ideally I want to infer the schema from a dataframe, the way it works now

nevillelyh commented 8 years ago

We dump DataFrame into Avro files and into BQ via a load job. I don't see any reference to partition in its documentation https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load. But you're welcome to play around it and submit a PR if you get it to work.

samelamin commented 8 years ago

Yeah it seems you need to make the table as partitioned at time of creation

Creating a partitioned table

To create a partitioned table, you must declare the table as partitioned at creation time. You do not need to specify a schema, as the schema can be specified when data is subsequently loaded or copied into the table.

While we do not need to specify a schema at creation (it can be created at load time), there is a subtle difference in the way you load it

To update data in a specific partition, append a partition decorator to the name of the partitioned table when loading data into the table. A partition decorator represents a specific date and takes the form:

Ill understand how its done then might do a PR :)

Thanks 👍