googleapis / python-bigquery-sqlalchemy

SQLAlchemy dialect for BigQuery
MIT License
436 stars 130 forks source link

Changing Column data types fails in Alembic migration due to missing SET DATA TYPE #471

Closed benvdh-incentro closed 1 year ago

benvdh-incentro commented 2 years ago

Environment details

Steps to reproduce

  1. Create a simple table class with single column with data type 'String'
  2. Autogenerate a migration to deploy the table with single column to BigQuery
  3. Deploy the table to BigQuery using alembic upgrade head
  4. Ensure alembic's env.py run_migrations_online passes the parameter compare_type=True to the context.configure() call in that function.
  5. Change the data type of the single column from String to ARRAY(String)
  6. Autogenerate a new migration and check that it has picked up the data type change.
  7. Attempt to deploy the second migration to BigQuery using alembic upgrade head
  8. Observe that sqlalchemy throws a DatabaseError:

sqlalchemy.exc.DatabaseError: (google.cloud.bigquery.dbapi.exceptions.DatabaseError) 400 Syntax error: Expected keyword DROP or keyword SET but got keyword TYPE at [1:72]

Expected behaviour

https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#alter_column_set_data_type_statement

Code example

# ORM model first migration (my_app/models/orm_model.py)
from sqlalchemy.orm import declarative_base

from sqlalchemy import Column, String

BQBase = declarative_base()

class SomeTable(BQBase):
    __tablename__ = "some_table"

    some_column = Column(
        String,
        name="some_column",
        nullable=False
    )

# ORM model second migration
from sqlalchemy.orm import declarative_base

from sqlalchemy import Column, String, ARRAY

BQBase = declarative_base()

class SomeTable(DWHBase):
    __tablename__ = "some_table"

    some_column = Column(
        ARRAY(String),
        name="some_column",
        nullable=False
    )

# alembic/env.py

import os
from logging.config import fileConfig

from sqlalchemy import engine_from_config
from sqlalchemy import pool

from dotenv import load_dotenv

from alembic import context
from my_app.models.orm_model import BQBase

# load environment variables
load_dotenv()

# this is the Alembic Config object, which provides
# access to the values within the .ini file in use.
config = context.config

# Interpret the config file for Python logging.
# This line sets up loggers basically.
fileConfig(config.config_file_name)

# add your model's MetaData object here
# for 'autogenerate' support
# from myapp import mymodel
# target_metadata = mymodel.Base.metadata
target_metadata = BQBase.metadata

# other values from the config, defined by the needs of env.py,
# can be acquired:
# my_important_option = config.get_main_option("my_important_option")
# ... etc.

def run_migrations_offline():
    """Run migrations in 'offline' mode.

    This configures the context with just a URL
    and not an Engine, though an Engine is acceptable
    here as well.  By skipping the Engine creation
    we don't even need a DBAPI to be available.

    Calls to context.execute() here emit the given string to the
    script output.

    """
    url = os.getenv("BIGQUERY_CONNECTION_URL")
    context.configure(
        url=url,
        target_metadata=target_metadata,
        compare_type=True,
        literal_binds=True,
        dialect_opts={"paramstyle": "named"},
    )

    with context.begin_transaction():
        context.run_migrations()

def run_migrations_online():
    """Run migrations in 'online' mode.

    In this scenario we need to create an Engine
    and associate a connection with the context.

    """
    connectable = engine_from_config(
        {"sqlalchemy.url": os.getenv("BIGQUERY_CONNECTION_URL")},
        prefix="sqlalchemy.",
        poolclass=pool.NullPool,
    )

    with connectable.connect() as connection:
        context.configure(
            connection=connection,
            target_metadata=target_metadata,
            compare_type=True
        )

        with context.begin_transaction():
            context.run_migrations()

if context.is_offline_mode():
    run_migrations_offline()
else:
    run_migrations_online()

# first migration
"""Add schema

Revision ID: 1
Revises: 
Create Date: 

"""
from alembic import op
import sqlalchemy as sa

# revision identifiers, used by Alembic.
revision = '1'
down_revision = None
branch_labels = None
depends_on = None

def upgrade():
    # ### commands auto generated by Alembic - please adjust! ###
    op.create_table('some_table',
    sa.Column('id', sa.String(), nullable=False),
    )

def downgrade():
    # ### commands auto generated by Alembic - please adjust! ###
    op.drop_table('some_table')

# second migration

"""Change data type of some_table.some_column

Revision ID: 2
Revises: 1
Create Date: 

"""
from alembic import op
import sqlalchemy as sa

# revision identifiers, used by Alembic.
revision = '2'
down_revision = '1'
branch_labels = None
depends_on = None

def upgrade():
    # ### commands auto generated by Alembic - please adjust! ###
    op.alter_column('some_table', 'some_column',
               existing_type=sa.String(),
               type_=sa.ARRAY(sa.String()),
               existing_nullable=False)
    # ### end Alembic commands ###

def downgrade():
    # ### commands auto generated by Alembic - please adjust! ###
    op.alter_column('some_table', 'some_column',
               existing_type=sa.ARRAY(sa.String()),
               type_=sa.String(),
               existing_nullable=False)
    # ### end Alembic commands ###

Stack trace

Traceback (most recent call last):
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/bin/alembic", line 8, in <module>
    sys.exit(main())
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/alembic/config.py", line 590, in main
    CommandLine(prog=prog).main(argv=argv)
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/alembic/config.py", line 584, in main
    self.run_cmd(cfg, options)
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/alembic/config.py", line 561, in run_cmd
    fn(
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/alembic/command.py", line 322, in upgrade
    script.run_env()
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/alembic/script/base.py", line 569, in run_env
    util.load_python_file(self.dir, "env.py")
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/alembic/util/pyfiles.py", line 94, in load_python_file
    module = load_module_py(module_id, path)
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/alembic/util/pyfiles.py", line 110, in load_module_py
    spec.loader.exec_module(module)  # type: ignore
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "./my_project/alembic/env.py", line 87, in <module>
    run_migrations_online()
  File "./my_project/alembic/env.py", line 81, in run_migrations_online
    context.run_migrations()
  File "<string>", line 8, in run_migrations
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/alembic/runtime/environment.py", line 853, in run_migrations
    self.get_context().run_migrations(**kw)
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/alembic/runtime/migration.py", line 623, in run_migrations
    step.migration_fn(**kw)
  File "/home/someUser/Documents/my_project/alembic/versions/32ccb975b7eb_change_type_of_education_institution_.py", line 21, in upgrade
    op.alter_column('some_table', 'some_column',
  File "<string>", line 8, in alter_column
  File "<string>", line 3, in alter_column
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/alembic/operations/ops.py", line 1880, in alter_column
    return operations.invoke(alt)
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/alembic/operations/base.py", line 394, in invoke
    return fn(self, operation)
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/alembic/operations/toimpl.py", line 50, in alter_column
    operations.impl.alter_column(
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/alembic/ddl/impl.py", line 275, in alter_column
    self._exec(
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/alembic/ddl/impl.py", line 195, in _exec
    return conn.execute(construct, multiparams)
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1289, in execute
    return meth(self, multiparams, params, _EMPTY_EXECUTION_OPTS)
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/sqlalchemy/sql/ddl.py", line 77, in _execute_on_connection
    return connection._execute_ddl(
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1381, in _execute_ddl
    ret = self._execute_context(
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1845, in _execute_context
    self._handle_dbapi_exception(
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2026, in _handle_dbapi_exception
    util.raise_(
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1802, in _execute_context
    self.dialect.do_execute(
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 719, in do_execute
    cursor.execute(statement, parameters)
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/google/cloud/bigquery/dbapi/_helpers.py", line 489, in with_closed_check
    return method(self, *args, **kwargs)
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/google/cloud/bigquery/dbapi/cursor.py", line 166, in execute
    self._execute(
  File "/home/someUser/.cache/pypoetry/virtualenvs/my-project-py3.9/lib/python3.9/site-packages/google/cloud/bigquery/dbapi/cursor.py", line 205, in _execute
    raise exceptions.DatabaseError(exc)
sqlalchemy.exc.DatabaseError: (google.cloud.bigquery.dbapi.exceptions.DatabaseError) 400 Syntax error: Expected keyword DROP or keyword SET but got keyword TYPE at [1:72]

(job ID: )

                               -----Query Job SQL Follows-----                                

    |    .    |    .    |    .    |    .    |    .    |    .    |    .    |    .    |
   1:ALTER TABLE `some_table` ALTER COLUMN `some_column` TYPE ARRAY<STRING>
    |    .    |    .    |    .    |    .    |    .    |    .    |    .    |    .    |
[SQL: ALTER TABLE `some_table` ALTER COLUMN `some_column` TYPE ARRAY<STRING>]
(Background on this error at: https://sqlalche.me/e/14/4xp6)

Making sure to follow these steps will guarantee the quickest resolution possible.

Thanks!

chalmerlowe commented 1 year ago

I have a question about this:

Looking at the documentation, it appears that BigQuery has the ability to do some conversions, but it appears as though there are limitations on which types of conversions are possible.

@shollyman @tswast Thoughts?

benvdh-incentro commented 1 year ago

@chalmerlowe just tested this quickly in the Cloud Console with the following query and test table.

Table schema (of table test_table):

[
  {
    "name": "my_text_column",
    "mode": "NULLABLE",
    "type": "STRING",
    "description": null,
    "fields": []
  },
  {
    "name": "my_int_column",
    "mode": "NULLABLE",
    "type": "INTEGER",
    "description": null,
    "fields": []
  }
]

Query to alter String column type:

ALTER TABLE `test_dataset.test-table` ALTER COLUMN `my_text_column` SET DATA TYPE ARRAY<STRING>;

The above query results in the following error in the console:

ALTER TABLE ALTER COLUMN SET DATA TYPE requires that the existing column type (STRING) is assignable to the new type (ARRAY<STRING>) at [1:39]

So you are right about the fact that the conversion in my initial example is incorrect. However, as shown above, the SQL syntax generated by the dialect is also incorrect. That is the original undesirable behaviour I intended to report.

However, using 2 types that are allowed to be coerced (INT -> NUMERIC), does not result in an error in the Cloud Console:

ALTER TABLE `test_dataset.test-table` ALTER COLUMN `my_int_column` SET DATA TYPE NUMERIC;

So despite the number of allowed column type changes being fairly limited, I think the point still stands that the SQL generated for such statements is not entirely correct....

UPDATE: I have updated the Expected Behaviour section in my report to take into account allowed, and not allowed type coercions.

chalmerlowe commented 1 year ago

@benvdh-incentro Thanks for this additional feedback. I agree with you and am tracking your point that my question did not directly address the main part of your concern. Gonna be honest, I am learning in public here, so I am exploring this issue step by step. Thus I am attempting to ensure that I more fully understand:

One question that I am trying to wrap my head around is:

When does the SQL statement get created (i.e. which library is responsible for stringing together the SQL).

In looking over the traceback, as best I can tell, by the time my library sends the SQL, it is already formed elsewhere and my code simply sends it...:

  File "/home/.../site-packages/google/cloud/bigquery/dbapi/_helpers.py", line 489, in with_closed_check
    return method(self, *args, **kwargs)
  File "/home/.../site-packages/google/cloud/bigquery/dbapi/cursor.py", line 166, in execute
    self._execute(
  File "/home/.../site-packages/google/cloud/bigquery/dbapi/cursor.py", line 205, in _execute
    raise exceptions.DatabaseError(exc)
sqlalchemy.exc.DatabaseError: (google.cloud.bigquery.dbapi.exceptions.DatabaseError) 400 Syntax error: 
    Expected keyword DROP or keyword SET but got keyword TYPE at [1:72]

Thoughts? am I missing something

benvdh-incentro commented 1 year ago

@chalmerlowe Just did a bit of digging:

It seems that the hardcoded string TYPE %s, in the base implementation of visit_column_type is causing the issue here...

benvdh-incentro commented 1 year ago

@chalmerlowe Here's a slightly more high-level answer...

With regard to DDL SQLAlchemy itself only supports drop and create statements it seems from all the methods in its DDLCompiler. While most of the ALTER logic is implemented in the alembic.ddl package... SQLAlchemy is relied upon mostly to execute the DDL Statements, and in case of default dialects handle the dialect specific info. python-bigquery-sqlalchemy is mostly another dialect for SQLAlchemy that handles the specifics for BigQuery...

benvdh-incentro commented 1 year ago

@chalmerlowe You might be wondering by now, how to get alembic to recognize the bigquery dialect... just came across this SO post, and a very nice discussion between Mike Bayer (author of SQLAlchemy + alembic) and someone wanting to built their own custom Impl class for Alembic, it seems you could integrate it into this package and it should work...:

SO Thread: https://stackoverflow.com/questions/59887588/how-to-add-new-dialect-to-alembic-besides-built-in-dialects

Discussion: https://groups.google.com/g/sqlalchemy-alembic/c/t3KmE9KDzH4/m/AK1UylnCCQAJ

benvdh-incentro commented 1 year ago

@chalmerlowe And a first go at a potential fix:

import sys

from alembic.ddl.base import ColumnType, alter_table, alter_column, format_type
from alembic.ddl.impl import DefaultImpl
from sqlalchemy import String, create_engine
from sqlalchemy.ext.compiler import compiles
from sqlalchemy.sql.compiler import DDLCompiler
from sqlalchemy_bigquery import BigQueryDialect

class MyImpl(DefaultImpl):
    __dialect__ = "bigquery"

@compiles(ColumnType, 'bigquery')
def visit_column_type(element: ColumnType, compiler: DDLCompiler, **kw) -> str:
    return "%s %s %s" % (
        alter_table(compiler, element.table_name, element.schema),
        alter_column(compiler, element.column_name),
        "SET DATA TYPE %s" % format_type(compiler, element.type_),
    )

my_impl = DefaultImpl.get_by_dialect(BigQueryDialect)

engine = create_engine("bigquery://my-project/my_dataset")
my_impl_obj = my_impl(BigQueryDialect(), engine, True, True, sys.stdout, dict())
my_impl_obj.alter_column("my_dataset.my-table", "my_column", type_=String)

The above code outputs the following:

ALTER TABLE `my_dataset.my-table` ALTER COLUMN `my_column` SET DATA TYPE STRING;

This did require the following change in BigQueryDialect's constructor in order for it to work:

self.identifier_preparer = self.preparer(self)

Additionally, I noticed the BigQueryIdentifierPreparer's quote method might need a fix too as currently, when providing the schema= parameter to alter column (instead of hardcoding the dataset name in the table name), it results in the following output (this might also be due to the double call to various quoting functions/methods in alembic's format_table_name:

ALTER TABLE `my_dataset`.`my-table` ALTER COLUMN `my_column` SET DATA TYPE STRING;

The above output follows from:

my_impl_obj.alter_column("my-table", "my_column", type_=String, schema="my_dataset")

For now, I will leave it at this... as it's quite late here... in case you need more info feel free to comment...

chalmerlowe commented 1 year ago

@benvdh-incentro

Would you be amenable to make a PR that encompasses these changes so that we can run it through the testing process and make sure none of the existing tests break?

benvdh-incentro commented 1 year ago

@chalmerlowe I can have a go at that... (assuming you think the above looks like the right approach)

One small implementation question though: would you like to have the alembic BigQueryImpl class in a separate file (something like alembic_migrations.py (to avoid namespace clashes with the regular alembic package) or just in the regular base.py ?

It might take a few days though before I have it ready, as I'm also working on a lot of other things...

benvdh-incentro commented 1 year ago

@chalmerlowe I have added a pull request... perhaps you can review, kick off the CI, or have one of your colleagues review it...