petl-developers / petl

Python Extract Transform and Load Tables of Data
MIT License
1.22k stars 189 forks source link

etl.todb() may not create a table properly #492

Open thmsklngr opened 4 years ago

thmsklngr commented 4 years ago

Minimal, reproducible code sample, a copy-pastable example if possible

import petl as etl
import pymssql
import pymysql

cnxn = pymssql.connect(server, username, password, database)

cursor = cnxn.cursor()
query = '''
SELECT Name, OperatingSystem, ILV
FROM
    dbo.Table 
'''

table_raw = etl.fromdb(cnxn, query)

table_agg = etl.aggregate(
    table_raw,
    key=('Name', 'OperatingSystem'),
    aggregation={'ilv_agg': ('ILV', etl.strjoin(','))})

target_conn = pymysql.connect()  # credentials removed
target_cursor = target_conn.cursor()
target_cursor.execute('SET SQL_MODE=ANSI_QUOTES')
etl.todb(
    table_agg,
    target_cursor,
    'petl_test',
    create=True,
    drop=True,
    commit=True,
    dialect='mysql')

Problem description

It seems that the table is not created properly using the maximum length of the longest value of 'Name'

.../site-packages/pymysql/cursors.py:170: Warning: (1265, "Data truncated for column 'Name' at row 5877")

I was looking for a way to get information about the types used or how to set the VARCHAR length used for creating the table in the database, but my quest wasn't successful.

Version and installation information

alimanfoo commented 4 years ago

Hi @mosesontheweb, the schema is inferred by default from a sample of the first 1000 rows of the table, see create_table for implementation. The schema inference is handled by SQLAlchemy. You may be able to resolve this either by increasing the number of rows sampled or by providing some metadata to help the schema inference. Apologies I don't have time to dig further at the moment.

thmsklngr commented 4 years ago

I'll try to increase the sample number, maybe this may solve that issue. It's weird that the longer values start somewhere from entry #5000+ in my example ...

Regards, Thomas

prae04 commented 3 years ago

Same issue here when using fromdb() -> todb(). Basically trying to copy one table to another sql database. However, table load fails on certain varchar() columns, Upon creation, varchar length does not fit all rows. Using the sample=0 arg normally causes todb() to run extremely slowly. Current table is ~500k rows.

`

engine1 = create_engine("mssql+pyodbc:///?odbc_connect=%s" % cloud_params)

engine2 = create_engine("mssql+pyodbc:///?odbc_connect=%s" % prem_params)

table = etl.fromdb(engine2, 'SELECT * from bigtable')
etl.todb(table, engine1,tablename='newbigtable', schema='stg', create=True,dialect='mssql', commit=True, sample=10000)

`