healthyregions / oeps

Opioid Environment Policy Scan - data explorer and backend management
https://oeps.healthyregions.org
GNU General Public License v3.0
0 stars 0 forks source link

Create sample dataset and table via GoogleSQL DDL and Python #2

Closed mradamcox closed 1 year ago

mradamcox commented 1 year ago

After looking around at the details of how #1 would be accomplished, we're finding that it doesn't seem possible to directly load a local CSV file into BigQuery with Python (or perhaps other methods as well). Instead, CSVs are typically uploaded to Google Cloud Storage and imports are then configured to transfer the content from there into BigQuery. Avoiding this step would be nice if possible.

With BigQuery's SQL variants (DDL, DML, DCL) it looks as though we can fully script the creation of all datasets and tables, and then fill those tables, by generating SQL and posting it via Python.

One nice thing about creating SQL statements (which we could save to .sql files) is that they could probably be adapted very easily to target a different RDBS, like Postgres/PostGIS, making this migration codebase a bit more generic than if we only used google python bindings.

At any rate, it is worth trying out, and the first step is to just create the dataset and table. These will ultimately be defined by JSON-formatted specs within this repo.

mradamcox commented 1 year ago

Here is a SQL statement to use for the test:

"""DROP SCHEMA IF EXISTS test_dataset;
CREATE SCHEMA test_dataset;
CREATE TABLE test_dataset.test_table
(
  x INT64 OPTIONS(description="An optional INTEGER field"),
  y INT64 OPTIONS(description="Another optional INTEGER field"),
)"""

The DROP IF EXISTS line should (hopefully) allow it to be run repeatedly during development without errors (creating a table that already exists will cause an error).

JuHe0120 commented 1 year ago

Here is an example for how to create a new table in Google BigQuery

#the modules you will use
from google.cloud import bigquery
from google.auth import impersonated_credentials
from google.auth.transport.requests import Request
from google.oauth2 import service_account
#set up project
project_id = 'your-project-id'
#set up crendentials
credentials_path = "key_path"
credentials = service_account.Credentials.from_service_account_file(
    credentials_path,
    scopes=["https://www.googleapis.com/auth/cloud-platform"],
)
if credentials.expired and credentials.refresh_token:
    credentials.refresh(Request())
#impersonating a target service account
target_principal = 'your-email-address-with-service-account'
target_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
impersonated_creds = impersonated_credentials.Credentials(
    source_credentials=credentials,
    target_principal=target_principal,
    target_scopes=target_scopes,
)
#create big query client
client = bigquery.Client(project=project_id, credentials=impersonated_creds)
#set schema(including columns and type of data)
schema = [
    bigquery.SchemaField('colunm-name-1', 'type1'),
    bigquery.SchemaField('colunm-name-2', 'type2'),
    bigquery.SchemaField('colunm-name-3', 'type3')
]
#create your table in your bigquery
table_options = bigquery.BigtableOptions()
dataset_ref = client.dataset('your-dataset-name')
table_ref = dataset_ref.table('your-table-name')
table = bigquery.Table(table_ref)
table.schema = schema
table = client.create_table(table)

print('Table created: {}'.format(table.table_id))
mradamcox commented 1 year ago

closed with #5