pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.55k stars 17.89k forks source link

to_sql issue with MS Azure data warehouse #17546

Open Saravji opened 7 years ago

Saravji commented 7 years ago

Code Sample, a copy-pastable example if possible

with the patch to sqlalchemy installed (see below), the following code snippet executes successfully:

connection_string = "mssql+pyodbc://<username>:<password>@<sqlhost>.database.windows.net:<port>/<database>?driver=ODBC+Driver+13+for+SQL+Server"
engn = sqlalchemy.engine.create_engine(connection_string, echo=True)
engn.connect()

The moment I add

df.to_sql(tbl_server_out, engn, if_exists='append', index=False)

I get the following error stack:

2017-09-13 15:07:50,200 INFO sqlalchemy.engine.base.Engine SELECT  SERVERPROPERTY('ProductVersion')
2017-09-13 15:07:50,202 INFO sqlalchemy.engine.base.Engine ()
2017-09-13 15:07:50,246 INFO sqlalchemy.engine.base.Engine SELECT schema_name()
2017-09-13 15:07:50,247 INFO sqlalchemy.engine.base.Engine ()
2017-09-13 15:07:50,522 INFO sqlalchemy.engine.base.Engine SELECT CAST('test plain returns' AS VARCHAR(60)) AS anon_1
2017-09-13 15:07:50,522 INFO sqlalchemy.engine.base.Engine ()
2017-09-13 15:07:50,562 INFO sqlalchemy.engine.base.Engine SELECT CAST('test unicode returns' AS NVARCHAR(60)) AS anon_1
2017-09-13 15:07:50,563 INFO sqlalchemy.engine.base.Engine ()
2017-09-13 15:07:50,604 INFO sqlalchemy.engine.base.Engine SELECT [INFORMATION_SCHEMA].[COLUMNS].[TABLE_SCHEMA], [INFORMATION_SCHEMA].[COLUMNS].[TABLE_NAME], [INFORMATION_SCHEMA].[COLUMNS].[COLUMN_NAME], [INFORMATION_SCHEMA].[COLUMNS].[IS_NULLABLE], [INFORMATION_SCHEMA].[COLUMNS].[DATA_TYPE], [INFORMATION_SCHEMA].[COLUMNS].[ORDINAL_POSITION], [INFORMATION_SCHEMA].[COLUMNS].[CHARACTER_MAXIMUM_LENGTH], [INFORMATION_SCHEMA].[COLUMNS].[NUMERIC_PRECISION], [INFORMATION_SCHEMA].[COLUMNS].[NUMERIC_SCALE], [INFORMATION_SCHEMA].[COLUMNS].[COLUMN_DEFAULT], [INFORMATION_SCHEMA].[COLUMNS].[COLLATION_NAME] 
FROM [INFORMATION_SCHEMA].[COLUMNS] 
WHERE [INFORMATION_SCHEMA].[COLUMNS].[TABLE_NAME] = CAST(? AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[COLUMNS].[TABLE_SCHEMA] = CAST(? AS NVARCHAR(max))
2017-09-13 15:07:50,605 INFO sqlalchemy.engine.base.Engine ('dbo.MSODS_DSI', 'dbo')
2017-09-13 15:07:52,151 INFO sqlalchemy.engine.base.Engine 
CREATE TABLE [dbo.MSODS_DSI] (
    [a] DATETIME NULL, 
    [b] VARCHAR(max) NULL, 
    [c] VARCHAR(max) NULL, 
    [d] VARCHAR(max) NULL, 
    [e] VARCHAR(max) NULL, 
    [f] FLOAT(53) NULL, 
    [g] FLOAT(53) NULL
)

2017-09-13 15:07:52,152 INFO sqlalchemy.engine.base.Engine ()
2017-09-13 15:07:54,374 INFO sqlalchemy.engine.base.Engine ROLLBACK
---------------------------------------------------------------------------
ProgrammingError                          Traceback (most recent call last)
/home/saravji/anaconda3/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1181                         parameters,
-> 1182                         context)
   1183         except BaseException as e:

/home/saravji/anaconda3/lib/python3.6/site-packages/sqlalchemy/engine/default.py in do_execute(self, cursor, statement, parameters, context)
    469     def do_execute(self, cursor, statement, parameters, context=None):
--> 470         cursor.execute(statement, parameters)
    471 

ProgrammingError: ('42000', "[42000] [Microsoft][ODBC Driver 13 for SQL Server][SQL Server]The statement failed. Column 'b' has a data type that cannot participate in a columnstore index.\r\nOperation cancelled by user. (35343) (SQLExecDirectW)")

The above exception was the direct cause of the following exception:

ProgrammingError                          Traceback (most recent call last)
<ipython-input-16-290e9c1020c9> in <module>()
     17 #cnxn = pyodbc.connect(connection_str)
     18 #engn.connect()
---> 19 df.to_sql(tbl_server_out, engn, if_exists='append', index=False)

/home/saravji/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in to_sql(self, name, con, flavor, schema, if_exists, index, index_label, chunksize, dtype)
   1343         sql.to_sql(self, name, con, flavor=flavor, schema=schema,
   1344                    if_exists=if_exists, index=index, index_label=index_label,
-> 1345                    chunksize=chunksize, dtype=dtype)
   1346 
   1347     def to_pickle(self, path, compression='infer'):

/home/saravji/anaconda3/lib/python3.6/site-packages/pandas/io/sql.py in to_sql(frame, name, con, flavor, schema, if_exists, index, index_label, chunksize, dtype)
    469     pandas_sql.to_sql(frame, name, if_exists=if_exists, index=index,
    470                       index_label=index_label, schema=schema,
--> 471                       chunksize=chunksize, dtype=dtype)
    472 
    473 

/home/saravji/anaconda3/lib/python3.6/site-packages/pandas/io/sql.py in to_sql(self, frame, name, if_exists, index, index_label, schema, chunksize, dtype)
   1148                          if_exists=if_exists, index_label=index_label,
   1149                          schema=schema, dtype=dtype)
-> 1150         table.create()
   1151         table.insert(chunksize)
   1152         if (not name.isdigit() and not name.islower()):

/home/saravji/anaconda3/lib/python3.6/site-packages/pandas/io/sql.py in create(self)
    596                     "'{0}' is not valid for if_exists".format(self.if_exists))
    597         else:
--> 598             self._execute_create()
    599 
    600     def insert_statement(self):

/home/saravji/anaconda3/lib/python3.6/site-packages/pandas/io/sql.py in _execute_create(self)
    581         # Inserting table into database, add to MetaData object
    582         self.table = self.table.tometadata(self.pd_sql.meta)
--> 583         self.table.create()
    584 
    585     def create(self):

/home/saravji/anaconda3/lib/python3.6/site-packages/sqlalchemy/sql/schema.py in create(self, bind, checkfirst)
    754         bind._run_visitor(ddl.SchemaGenerator,
    755                           self,
--> 756                           checkfirst=checkfirst)
    757 
    758     def drop(self, bind=None, checkfirst=False):

/home/saravji/anaconda3/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _run_visitor(self, visitorcallable, element, connection, **kwargs)
   1927                      connection=None, **kwargs):
   1928         with self._optional_conn_ctx_manager(connection) as conn:
-> 1929             conn._run_visitor(visitorcallable, element, **kwargs)
   1930 
   1931     class _trans_ctx(object):

/home/saravji/anaconda3/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _run_visitor(self, visitorcallable, element, **kwargs)
   1536     def _run_visitor(self, visitorcallable, element, **kwargs):
   1537         visitorcallable(self.dialect, self,
-> 1538                         **kwargs).traverse_single(element)
   1539 
   1540 

/home/saravji/anaconda3/lib/python3.6/site-packages/sqlalchemy/sql/visitors.py in traverse_single(self, obj, **kw)
    119             meth = getattr(v, "visit_%s" % obj.__visit_name__, None)
    120             if meth:
--> 121                 return meth(obj, **kw)
    122 
    123     def iterate(self, obj):

/home/saravji/anaconda3/lib/python3.6/site-packages/sqlalchemy/sql/ddl.py in visit_table(self, table, create_ok, include_foreign_key_constraints, _is_metadata_operation)
    765             CreateTable(
    766                 table,
--> 767                 include_foreign_key_constraints=include_foreign_key_constraints
    768             ))
    769 

/home/saravji/anaconda3/lib/python3.6/site-packages/sqlalchemy/engine/base.py in execute(self, object, *multiparams, **params)
    943             raise exc.ObjectNotExecutableError(object)
    944         else:
--> 945             return meth(self, multiparams, params)
    946 
    947     def _execute_function(self, func, multiparams, params):

/home/saravji/anaconda3/lib/python3.6/site-packages/sqlalchemy/sql/ddl.py in _execute_on_connection(self, connection, multiparams, params)
     66 
     67     def _execute_on_connection(self, connection, multiparams, params):
---> 68         return connection._execute_ddl(self, multiparams, params)
     69 
     70     def execute(self, bind=None, target=None):

/home/saravji/anaconda3/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_ddl(self, ddl, multiparams, params)
   1000             compiled,
   1001             None,
-> 1002             compiled
   1003         )
   1004         if self._has_events or self.engine._has_events:

/home/saravji/anaconda3/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1187                 parameters,
   1188                 cursor,
-> 1189                 context)
   1190 
   1191         if self._has_events or self.engine._has_events:

/home/saravji/anaconda3/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _handle_dbapi_exception(self, e, statement, parameters, cursor, context)
   1400                 util.raise_from_cause(
   1401                     sqlalchemy_exception,
-> 1402                     exc_info
   1403                 )
   1404             else:

/home/saravji/anaconda3/lib/python3.6/site-packages/sqlalchemy/util/compat.py in raise_from_cause(exception, exc_info)
    201     exc_type, exc_value, exc_tb = exc_info
    202     cause = exc_value if exc_value is not exception else None
--> 203     reraise(type(exception), exception, tb=exc_tb, cause=cause)
    204 
    205 if py3k:

/home/saravji/anaconda3/lib/python3.6/site-packages/sqlalchemy/util/compat.py in reraise(tp, value, tb, cause)
    184             value.__cause__ = cause
    185         if value.__traceback__ is not tb:
--> 186             raise value.with_traceback(tb)
    187         raise value
    188 

/home/saravji/anaconda3/lib/python3.6/site-packages/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1180                         statement,
   1181                         parameters,
-> 1182                         context)
   1183         except BaseException as e:
   1184             self._handle_dbapi_exception(

/home/saravji/anaconda3/lib/python3.6/site-packages/sqlalchemy/engine/default.py in do_execute(self, cursor, statement, parameters, context)
    468 
    469     def do_execute(self, cursor, statement, parameters, context=None):
--> 470         cursor.execute(statement, parameters)
    471 
    472     def do_execute_no_params(self, cursor, statement, context=None):

ProgrammingError: (pyodbc.ProgrammingError) ('42000', "[42000] [Microsoft][ODBC Driver 13 for SQL Server][SQL Server]The statement failed. Column 'b' has a data type that cannot participate in a columnstore index.\r\nOperation cancelled by user. (35343) (SQLExecDirectW)") [SQL: '\nCREATE TABLE [dbo.MSODS_DSI] (\n\t[a] DATETIME NULL, \n\t[b] VARCHAR(max) NULL, \n\t[c] VARCHAR(max) NULL, \n\t[d] VARCHAR(max) NULL, \n\t[e] VARCHAR(max) NULL, \n\t[f] FLOAT(53) NULL, \n\t[g] FLOAT(53) NULL\n)\n\n']

Problem description

In the configuration python 3.6.1, pandas, pyodbc, sqlalchemy and Azure SQL DataWarehouse the df.to_sql(..., if_exists='append') call actually executes a create table sql statement (with deviating from the existing table column definition). As the table exists, this is supposed to fail. It appears that the create table statement should not be executed if the option "if_exists='append'" is set? There is no issue with the installation / database connection. I can execute sql queries with raw pyodbc (insert, delete, select) as well as execute read_sql.

Note: A patch to sqlalchemy needs to be applied to prevent an additional error from happening. see this thread for details: google mailing list sqlalchemy thread pandas to MS SQL DataWarehouse (to_sql)

Note: We receive a lot of issues on our GitHub tracker: I have checked and could not find related issues.

Note: Many problems can be resolved by simply upgrading pandas to the latest version. Before submitting, please check if that solution works for you: I'm on very recent version, the updates to 0.20.2 and 0.20.3 do not list any changes to the to_sql() functionality

Expected Output

Output of pd.show_versions()

[paste the output of ``pd.show_versions()`` here below this line] INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-93-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.20.1 pytest: 3.0.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: 5.3.0 sphinx: 1.5.6 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.3.0 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.3 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.13 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None
chris-b1 commented 7 years ago

I'd suggest debugging around here, and see why exists() isn't returning true. https://github.com/pandas-dev/pandas/blob/328c7e179b72e257e27adf92a06718fd5a40473f/pandas/io/sql.py#L590

PR welcome to fix!