etl.todb() may not create a table properly

thmsklngr commented 4 years ago

Minimal, reproducible code sample, a copy-pastable example if possible

import petl as etl
import pymssql
import pymysql

cnxn = pymssql.connect(server, username, password, database)

cursor = cnxn.cursor()
query = '''
SELECT Name, OperatingSystem, ILV
FROM
    dbo.Table 
'''

table_raw = etl.fromdb(cnxn, query)

table_agg = etl.aggregate(
    table_raw,
    key=('Name', 'OperatingSystem'),
    aggregation={'ilv_agg': ('ILV', etl.strjoin(','))})

target_conn = pymysql.connect()  # credentials removed
target_cursor = target_conn.cursor()
target_cursor.execute('SET SQL_MODE=ANSI_QUOTES')
etl.todb(
    table_agg,
    target_cursor,
    'petl_test',
    create=True,
    drop=True,
    commit=True,
    dialect='mysql')

Problem description

It seems that the table is not created properly using the maximum length of the longest value of 'Name'

.../site-packages/pymysql/cursors.py:170: Warning: (1265, "Data truncated for column 'Name' at row 5877")

I was looking for a way to get information about the types used or how to set the VARCHAR length used for creating the table in the database, but my quest wasn't successful.

Version and installation information

Value of petl.__version__: 1.4.0
Version information for any third-party package dependencies that are relevant: SQLAlchemy 1.3.17
Version of Python interpreter: 3.6.8
Operating system (Linux/Windows/Mac): Ubuntu Linux 18.04 LTS
How petl was installed (e.g., "using pip into virtual environment", or "using conda"): pipenv

alimanfoo commented 4 years ago

Hi @mosesontheweb, the schema is inferred by default from a sample of the first 1000 rows of the table, see create_table for implementation. The schema inference is handled by SQLAlchemy. You may be able to resolve this either by increasing the number of rows sampled or by providing some metadata to help the schema inference. Apologies I don't have time to dig further at the moment.

thmsklngr commented 4 years ago

I'll try to increase the sample number, maybe this may solve that issue. It's weird that the longer values start somewhere from entry #5000+ in my example ...

Regards, Thomas

prae04 commented 3 years ago

Same issue here when using fromdb() -> todb(). Basically trying to copy one table to another sql database. However, table load fails on certain varchar() columns, Upon creation, varchar length does not fit all rows. Using the sample=0 arg normally causes todb() to run extremely slowly. Current table is ~500k rows.

`

engine1 = create_engine("mssql+pyodbc:///?odbc_connect=%s" % cloud_params)

engine2 = create_engine("mssql+pyodbc:///?odbc_connect=%s" % prem_params)

table = etl.fromdb(engine2, 'SELECT * from bigtable')
etl.todb(table, engine1,tablename='newbigtable', schema='stg', create=True,dialect='mssql', commit=True, sample=10000)

`

petl-developers / petl