Closed benvdh-incentro closed 1 year ago
I have a question about this:
Looking at the documentation, it appears that BigQuery has the ability to do some conversions, but it appears as though there are limitations on which types of conversions are possible.
@shollyman @tswast Thoughts?
@chalmerlowe just tested this quickly in the Cloud Console with the following query and test table.
Table schema (of table test_table):
[
{
"name": "my_text_column",
"mode": "NULLABLE",
"type": "STRING",
"description": null,
"fields": []
},
{
"name": "my_int_column",
"mode": "NULLABLE",
"type": "INTEGER",
"description": null,
"fields": []
}
]
Query to alter String column type:
ALTER TABLE `test_dataset.test-table` ALTER COLUMN `my_text_column` SET DATA TYPE ARRAY<STRING>;
The above query results in the following error in the console:
ALTER TABLE ALTER COLUMN SET DATA TYPE requires that the existing column type (STRING) is assignable to the new type (ARRAY<STRING>) at [1:39]
So you are right about the fact that the conversion in my initial example is incorrect. However, as shown above, the SQL syntax generated by the dialect is also incorrect. That is the original undesirable behaviour I intended to report.
However, using 2 types that are allowed to be coerced (INT -> NUMERIC), does not result in an error in the Cloud Console:
ALTER TABLE `test_dataset.test-table` ALTER COLUMN `my_int_column` SET DATA TYPE NUMERIC;
So despite the number of allowed column type changes being fairly limited, I think the point still stands that the SQL generated for such statements is not entirely correct....
UPDATE: I have updated the Expected Behaviour section in my report to take into account allowed, and not allowed type coercions.
@benvdh-incentro Thanks for this additional feedback. I agree with you and am tracking your point that my question did not directly address the main part of your concern. Gonna be honest, I am learning in public here, so I am exploring this issue step by step. Thus I am attempting to ensure that I more fully understand:
alembic
sqlalchemy
BigQuery
python-bigquery-sqlalchemy
interacts with all three of the aboveOne question that I am trying to wrap my head around is:
When does the SQL statement get created (i.e. which library is responsible for stringing together the SQL).
In looking over the traceback, as best I can tell, by the time my library sends the SQL, it is already formed elsewhere and my code simply sends it...:
File "/home/.../site-packages/google/cloud/bigquery/dbapi/_helpers.py", line 489, in with_closed_check
return method(self, *args, **kwargs)
File "/home/.../site-packages/google/cloud/bigquery/dbapi/cursor.py", line 166, in execute
self._execute(
File "/home/.../site-packages/google/cloud/bigquery/dbapi/cursor.py", line 205, in _execute
raise exceptions.DatabaseError(exc)
sqlalchemy.exc.DatabaseError: (google.cloud.bigquery.dbapi.exceptions.DatabaseError) 400 Syntax error:
Expected keyword DROP or keyword SET but got keyword TYPE at [1:72]
Thoughts? am I missing something
@chalmerlowe Just did a bit of digging:
visit_column_type
by calling 3 functions:
alter_table
alter_column
format_type
alter_table
generates the ALTER TABLE
part of the string, and then relies on format_table_name
to do the rest (some stuff can be set here using parameters)alter_column
follows a similar flow as alter_table
format_type
however, relies on the dialect's TypeCompiler
to do it's job. That TypeCompiler is both part of this library (the class BigQueryTypeCompiler
and SQLAlchemy's GenericTypeCompiler
which it inherits from...It seems that the hardcoded string TYPE %s
, in the base implementation of visit_column_type
is causing the issue here...
@chalmerlowe Here's a slightly more high-level answer...
With regard to DDL SQLAlchemy itself only supports drop and create statements it seems from all the methods in its DDLCompiler. While most of the ALTER logic is implemented in the alembic.ddl package... SQLAlchemy is relied upon mostly to execute the DDL Statements, and in case of default dialects handle the dialect specific info. python-bigquery-sqlalchemy
is mostly another dialect for SQLAlchemy that handles the specifics for BigQuery...
@chalmerlowe You might be wondering by now, how to get alembic to recognize the bigquery dialect... just came across this SO post, and a very nice discussion between Mike Bayer (author of SQLAlchemy + alembic) and someone wanting to built their own custom Impl class for Alembic, it seems you could integrate it into this package and it should work...:
Discussion: https://groups.google.com/g/sqlalchemy-alembic/c/t3KmE9KDzH4/m/AK1UylnCCQAJ
@chalmerlowe And a first go at a potential fix:
import sys
from alembic.ddl.base import ColumnType, alter_table, alter_column, format_type
from alembic.ddl.impl import DefaultImpl
from sqlalchemy import String, create_engine
from sqlalchemy.ext.compiler import compiles
from sqlalchemy.sql.compiler import DDLCompiler
from sqlalchemy_bigquery import BigQueryDialect
class MyImpl(DefaultImpl):
__dialect__ = "bigquery"
@compiles(ColumnType, 'bigquery')
def visit_column_type(element: ColumnType, compiler: DDLCompiler, **kw) -> str:
return "%s %s %s" % (
alter_table(compiler, element.table_name, element.schema),
alter_column(compiler, element.column_name),
"SET DATA TYPE %s" % format_type(compiler, element.type_),
)
my_impl = DefaultImpl.get_by_dialect(BigQueryDialect)
engine = create_engine("bigquery://my-project/my_dataset")
my_impl_obj = my_impl(BigQueryDialect(), engine, True, True, sys.stdout, dict())
my_impl_obj.alter_column("my_dataset.my-table", "my_column", type_=String)
The above code outputs the following:
ALTER TABLE `my_dataset.my-table` ALTER COLUMN `my_column` SET DATA TYPE STRING;
This did require the following change in BigQueryDialect
's constructor in order for it to work:
self.identifier_preparer = self.preparer(self)
Additionally, I noticed the BigQueryIdentifierPreparer
's quote method might need a fix too as currently, when providing the schema=
parameter to alter column (instead of hardcoding the dataset name in the table name), it results in the following output (this might also be due to the double call to various quoting functions/methods in alembic's format_table_name
:
ALTER TABLE `my_dataset`.`my-table` ALTER COLUMN `my_column` SET DATA TYPE STRING;
The above output follows from:
my_impl_obj.alter_column("my-table", "my_column", type_=String, schema="my_dataset")
For now, I will leave it at this... as it's quite late here... in case you need more info feel free to comment...
@benvdh-incentro
Would you be amenable to make a PR that encompasses these changes so that we can run it through the testing process and make sure none of the existing tests break?
@chalmerlowe I can have a go at that... (assuming you think the above looks like the right approach)
One small implementation question though: would you like to have the alembic BigQueryImpl
class in a separate file (something like alembic_migrations.py
(to avoid namespace clashes with the regular alembic package) or just in the regular base.py
?
It might take a few days though before I have it ready, as I'm also working on a lot of other things...
@chalmerlowe I have added a pull request... perhaps you can review, kick off the CI, or have one of your colleagues review it...
Environment details
sqlalchemy-bigquery
version: 1.4.4sqlalchemy
version: 1.4.27alembic
version: 1.8.0Steps to reproduce
alembic upgrade head
run_migrations_online
passes the parametercompare_type=True
to thecontext.configure()
call in that function.String
toARRAY(String)
alembic upgrade head
DatabaseError
:Expected behaviour
The migration executes properlyhttps://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#alter_column_set_data_type_statement
Code example
Stack trace
Making sure to follow these steps will guarantee the quickest resolution possible.
Thanks!