petl-developers / petl

Python Extract Transform and Load Tables of Data
MIT License
1.24k stars 193 forks source link

PETL causes pymysql.err.InterfaceError: (0, '') #422

Open coriolinus opened 7 years ago

coriolinus commented 7 years ago

This error is apparently caused by attempted use of a connection or cursor which has already been closed, per StackOverflow. My code doesn't handle any cursors at all; all DB access is managed through PETL, and a new connection is created each time .todb() or .fromdb() is called.

Source DB is RDS/Aurora; target is sometimes sqlite3, sometimes RDS/Postgres. That the error comes via pymysql indicates that the previously-closed DB is the source DB, not the target.

Full stack trace follows:

$ ./etl.py -l
Reading configuration data... (0.00/0.00)s
Extracting admin users... (0.80/0.80)s
Extracting normal users... (0.74/1.55)s
Preparing admin user transformation layer... (0.76/2.30)s
Preparing set of admin user emails... (0.38/2.68)s
Preparing normal user transformation layer... (7.92/10.60)s
Preparing complete user transformation layer... (4.25/14.85)s
Preparing lookup set of valid user ids... (4.23/19.09)s
Loading users table into auth_user...
100 rows in 3.50s (28 row/s); batch in 3.50s (28 row/s)
200 rows in 3.69s (54 row/s); batch in 0.18s (541 row/s)
300 rows in 3.73s (80 row/s); batch in 0.04s (2468 row/s)
400 rows in 3.88s (103 row/s); batch in 0.15s (676 row/s)
500 rows in 3.92s (127 row/s); batch in 0.04s (2502 row/s)
600 rows in 3.94s (152 row/s); batch in 0.02s (4238 row/s)
700 rows in 3.97s (176 row/s); batch in 0.04s (2801 row/s)
800 rows in 4.07s (196 row/s); batch in 0.10s (1050 row/s)
900 rows in 4.10s (219 row/s); batch in 0.03s (3339 row/s)
1000 rows in 4.14s (241 row/s); batch in 0.04s (2691 row/s)
1100 rows in 4.18s (263 row/s); batch in 0.04s (2560 row/s)
1200 rows in 4.21s (284 row/s); batch in 0.04s (2659 row/s)
1253 rows in 4.23s (296 row/s); batches in 0.35 +/- 0.95s [0.02-3.50] (2129 +/- 1211 rows/s [28-4238])
(4.71/23.80)s
Extracting universities table... (0.82/24.63)s
Preparing to transform universities table... (0.85/25.48)s
Preparing to transform user profile... (0.00/25.48)s
Loading user profile into core_userprofile...
50 rows in 144.30s (0 row/s); batch in 144.30s (0 row/s)
100 rows in 269.07s (0 row/s); batch in 124.77s (0 row/s)
150 rows in 397.52s (0 row/s); batch in 128.44s (0 row/s)
200 rows in 518.95s (0 row/s); batch in 121.44s (0 row/s)
250 rows in 663.08s (0 row/s); batch in 144.12s (0 row/s)
300 rows in 793.66s (0 row/s); batch in 130.58s (0 row/s)
350 rows in 916.43s (0 row/s); batch in 122.77s (0 row/s)
400 rows in 1035.08s (0 row/s); batch in 118.65s (0 row/s)
450 rows in 1150.27s (0 row/s); batch in 115.19s (0 row/s)
500 rows in 1258.56s (0 row/s); batch in 108.29s (0 row/s)
550 rows in 1366.53s (0 row/s); batch in 107.97s (0 row/s)
600 rows in 1478.12s (0 row/s); batch in 111.58s (0 row/s)
650 rows in 1585.24s (0 row/s); batch in 107.13s (0 row/s)
700 rows in 1691.73s (0 row/s); batch in 106.49s (0 row/s)
750 rows in 1796.29s (0 row/s); batch in 104.56s (0 row/s)
800 rows in 1905.31s (0 row/s); batch in 109.03s (0 row/s)
850 rows in 2011.86s (0 row/s); batch in 106.55s (0 row/s)
900 rows in 2122.80s (0 row/s); batch in 110.94s (0 row/s)
950 rows in 2238.93s (0 row/s); batch in 116.13s (0 row/s)
1000 rows in 2359.86s (0 row/s); batch in 120.93s (0 row/s)
1050 rows in 2475.09s (0 row/s); batch in 115.23s (0 row/s)
1100 rows in 2589.37s (0 row/s); batch in 114.28s (0 row/s)
1150 rows in 2710.57s (0 row/s); batch in 121.20s (0 row/s)
1200 rows in 2828.02s (0 row/s); batch in 117.45s (0 row/s)
1249 rows in 2929.80s (0 row/s); batches in 117.83 +/- 10.61s [104.56-144.30] (0 +/- 0 rows/s [0-0])
(2935.72/2961.20)s
Loading universities table into core_userprofile_universities...
! Lost connection to source db loading unis; reconnecting...
Extracting universities table... (0.83/2962.22)s
Preparing to transform universities table... (0.83/2963.06)s
Loading universities table into core_userprofile_universities...
100 rows in 0.63s (159 row/s); batch in 0.63s (159 row/s)
200 rows in 0.63s (317 row/s); batch in 0.00s (64796 row/s)
300 rows in 0.63s (475 row/s); batch in 0.00s (69986 row/s)
400 rows in 0.63s (631 row/s); batch in 0.00s (66980 row/s)
500 rows in 0.63s (788 row/s); batch in 0.00s (67573 row/s)
600 rows in 0.64s (943 row/s); batch in 0.00s (65937 row/s)
615 rows in 0.64s (966 row/s); batches in 0.11 +/- 0.23s [0.00-0.63] (55905 +/- 24981 rows/s [159-69986])
(0.86/2963.92)s
Extracting children... (0.75/2964.67)s
Preparing to transform children... (0.00/2964.67)s
Loading children into core_child...
100 rows in 96.06s (1 row/s); batch in 96.06s (1 row/s)
200 rows in 167.39s (1 row/s); batch in 71.34s (1 row/s)
300 rows in 241.47s (1 row/s); batch in 74.08s (1 row/s)
400 rows in 331.58s (1 row/s); batch in 90.11s (1 row/s)
500 rows in 395.52s (1 row/s); batch in 63.94s (1 row/s)
600 rows in 444.86s (1 row/s); batch in 49.34s (2 row/s)
700 rows in 488.18s (1 row/s); batch in 43.32s (2 row/s)
800 rows in 536.80s (1 row/s); batch in 48.62s (2 row/s)
900 rows in 580.63s (1 row/s); batch in 43.83s (2 row/s)
1000 rows in 622.97s (1 row/s); batch in 42.34s (2 row/s)
1100 rows in 664.70s (1 row/s); batch in 41.73s (2 row/s)
1200 rows in 694.21s (1 row/s); batch in 29.52s (3 row/s)
1300 rows in 740.61s (1 row/s); batch in 46.40s (2 row/s)
1400 rows in 798.21s (1 row/s); batch in 57.60s (1 row/s)
1500 rows in 844.36s (1 row/s); batch in 46.15s (2 row/s)
1600 rows in 894.27s (1 row/s); batch in 49.92s (2 row/s)
1700 rows in 930.98s (1 row/s); batch in 36.71s (2 row/s)
1789 rows in 981.40s (1 row/s); batches in 54.76 +/- 17.93s [29.52-96.06] (1 +/- 0 rows/s [1-3])
(981.64/3946.31)s
Preparing to transform languages table... (0.01/3946.32)s
Loading languages table into core_userprofile_languages...
! Lost connection to source db; reconnecting...
Preparing to transform languages table... (0.00/3946.51)s
Extracting mom times... (0.80/3947.31)s
Preparing to transform mom times... (0.00/3947.31)s
Loading mom times into core_userprofile_mom_times...
1000 rows in 0.82s (1216 row/s); batch in 0.82s (1216 row/s)
2000 rows in 1.04s (1922 row/s); batch in 0.22s (4583 row/s)
3000 rows in 1.25s (2400 row/s); batch in 0.21s (4780 row/s)
4000 rows in 1.41s (2836 row/s); batch in 0.16s (6228 row/s)
5000 rows in 1.55s (3221 row/s); batch in 0.14s (7049 row/s)
6000 rows in 1.67s (3602 row/s); batch in 0.11s (8823 row/s)
7000 rows in 1.78s (3928 row/s); batch in 0.12s (8598 row/s)
7926 rows in 1.89s (4187 row/s); batches in 0.25 +/- 0.23s [0.11-0.82] (5896 +/- 2454 rows/s [1216-8823])
(2.32/3949.63)s
Extracting kid times... (0.77/3950.40)s
Preparing to transform kid times... (0.00/3950.40)s
Loading kid times into core_userprofile_kid_times...
1000 rows in 0.81s (1227 row/s); batch in 0.81s (1227 row/s)
2000 rows in 1.02s (1968 row/s); batch in 0.20s (4977 row/s)
3000 rows in 1.24s (2421 row/s); batch in 0.22s (4481 row/s)
4000 rows in 1.39s (2883 row/s); batch in 0.15s (6736 row/s)
5000 rows in 1.52s (3292 row/s); batch in 0.13s (7625 row/s)
6000 rows in 1.63s (3670 row/s); batch in 0.12s (8616 row/s)
7000 rows in 1.74s (4023 row/s); batch in 0.11s (9502 row/s)
8000 rows in 1.90s (4217 row/s); batch in 0.16s (6366 row/s)
8762 rows in 1.97s (4439 row/s); batches in 0.24 +/- 0.22s [0.11-0.81] (6191 +/- 2456 rows/s [1227-9502])
(2.40/3952.81)s
Preparing to transform work table... (0.00/3952.81)s
Loading work table into core_userprofile_work...
Traceback (most recent call last):
  File "./etl.py", line 400, in <module>
    etl(args.destination, args.show_tables, args.load, args.timer)
  File "./etl.py", line 276, in etl
    pass
  File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "./etl.py", line 164, in mini_etl
    load_table(table, human_readable, output_tablename, load_override, progress)
  File "./etl.py", line 113, in load_table
    table.todb(connect_dest(auth, destination), tablename=output_tablename)
  File ".venv/lib/python3.4/site-packages/petl/io/db.py", line 337, in todb
    truncate=True)
  File ".venv/lib/python3.4/site-packages/petl/io/db.py", line 356, in _todb
    commit=commit, truncate=truncate)
  File ".venv/lib/python3.4/site-packages/petl/io/db.py", line 408, in _todb_dbapi_connection
    hdr = next(it)
  File ".venv/lib/python3.4/site-packages/petl/util/timing.py", line 57, in __iter__
    for n, r in enumerate(self.inner):
  File ".venv/lib/python3.4/site-packages/petl/transform/selects.py", line 115, in iterfieldselect
    hdr = next(it)
  File ".venv/lib/python3.4/site-packages/petl/transform/conversions.py", line 318, in iterfieldconvert
    hdr = next(it)
  File ".venv/lib/python3.4/site-packages/petl/transform/selects.py", line 115, in iterfieldselect
    hdr = next(it)
  File ".venv/lib/python3.4/site-packages/petl/transform/maps.py", line 295, in iterrowmapmany
    hdr = next(it)
  File ".venv/lib/python3.4/site-packages/petl/io/db.py", line 156, in _iter_dbapi_connection
    for row in _iter_dbapi_cursor(cursor, query, *args, **kwargs):
  File ".venv/lib/python3.4/site-packages/petl/io/db.py", line 163, in _iter_dbapi_cursor
    cursor.execute(query, *args, **kwargs)
  File ".venv/lib/python3.4/site-packages/pymysql/cursors.py", line 161, in execute
    while self.nextset():
  File ".venv/lib/python3.4/site-packages/pymysql/cursors.py", line 443, in nextset
    return self._nextset(unbuffered=True)
  File ".venv/lib/python3.4/site-packages/pymysql/cursors.py", line 93, in _nextset
    self._show_warnings()
  File ".venv/lib/python3.4/site-packages/pymysql/cursors.py", line 347, in _show_warnings
    ws = self._get_db().show_warnings()
  File ".venv/lib/python3.4/site-packages/pymysql/connections.py", line 797, in show_warnings
    self._execute_command(COMMAND.COM_QUERY, "SHOW WARNINGS")
  File ".venv/lib/python3.4/site-packages/pymysql/connections.py", line 1071, in _execute_command
    raise err.InterfaceError("(0, '')")
pymysql.err.InterfaceError: (0, '')
alimanfoo commented 7 years ago

Hi @coriolinus, could you post the relevant parts of your etl.py script?

coriolinus commented 7 years ago

Sure. By necessity, these are excerpts; they're verbose enough as it is.

Lines 108-167:

        def load_table(table, human_readable, output_tablename, load_override=load, progress=100):
            if load_override:
                with timed_op():
                    print("Loading {} into {}...".format(human_readable, output_tablename))
                    table = table.progress(progress)
                    table.todb(connect_dest(auth, destination), tablename=output_tablename)

        @contextmanager
        def mini_etl(human_readable, query,
                     output_tablename,
                     transform_func,
                     tf_args=[], tf_kwargs={},
                     progress=100,
                     use_table=None,
                     load_override=load,
                     show_override=show_tables):
            """
            Perform a batch of ETL operations for a single PETL table

            This should reduce the amount of code duplication necessary for
            standard table migration.

            query: raw SQL query which populates the initial table state
            output_tablename: name of the destination table
            transform_func: This function should transform the input PETL table into
                the output PETL table. Called like `transform_func(table, *tf_args, **tf_kwargs)`
            tf_args, tf_kwargs: Additional positional and keyword arguments for the transform func.
            progress: batch size when loading the output table. Default 100.
            use_table: if not None, ignore query and instead use this PETL table as input
            load_override: override whether to load this table
            show_override: override whether to show this table

            This is a context manager so that you have access to the table for additional
            operations, if necessary. transform_func runs before the table is returned.
            """
            def prepare_table():
                if use_table is not None:
                    table = use_table
                else:
                    with timed_op("Extracting {}...".format(human_readable)):
                        table = petl.fromdb(connect_source(auth), query)

                with timed_op("Preparing to transform {}...".format(human_readable)):
                    table = transform_func(table, *tf_args, **tf_kwargs)
                return table
            table = prepare_table()

            contained_table = MutableContainer(table)
            yield contained_table
            table = contained_table.item

            if show_override:
                print("Head of {}:".format(human_readable))
                print(table.head())

            try:
                load_table(table, human_readable, output_tablename, load_override, progress)
            except pymysql.err.OperationalError:
                print("! Lost connection to source db; reconnecting...")
                table = prepare_table()

Lines 273-276:

        with mini_etl('work table', '', linked_table.WORK_OUT_TABLE,
                      linked_table.create, [connect_dest(auth, destination), 'work', user_ids],
                      use_table=normal_users):
            pass

Lines 379-400:

if __name__ == '__main__':
    auth = getauth()
    destinations = list(auth['destination'].keys())

    import argparse
    parser = argparse.ArgumentParser(
        description='Extract, Transform, Load data from old db format to new')
    parser.add_argument('-s', '--show-tables', action='store_true',
                        help='Display the heads of various tables as they are processed.')
    parser.add_argument('-l', '--load', action='store_true',
                        help='Actually load the transformed data into the destination database. '
                             'Note that this eliminates any data currently stored there!')
    parser.add_argument('--no-timer', action='store_false', dest='timer',
                        help='Disable the automatic timing of these actions.')
    parser.add_argument('-d', '--destination', choices=destinations, default='default',
                        help='Choose the destination database into which to load the transformed '
                             'data. Sourced from the sub-keys of the `destination.XXX` tables '
                             'of `auth.toml`.')

    args = parser.parse_args()

    etl(args.destination, args.show_tables, args.load, args.timer)
alimanfoo commented 7 years ago

Sorry, nothing obvious jumps out.

coriolinus commented 7 years ago

Yes, if anything obvious had jumped out, I would have attempted a solution. I'd hoped that possibly the stack trace would be useful to someone knowledgeable about PETL internals.

I guess the fundamental question here is, if I simply catch the error, is there any way to hand PETL a new connection object to the source db and try the .todb() function again, without needing to go back and re-do all the intermediate transforms? If so, I can work around this. If not, this becomes much harder to solve.

alimanfoo commented 7 years ago

Petl never calls close() on a connection object. So if the underlying error is caused by trying to read from a closed connection, and you are indeed passing a DB-API connection object to fromdb(), then it is not obvious how that could happen. I would need a minimal reproducible example to go any further.

You can replace the connection object on a DbView instance (returned by fromdb()), e.g.:

t = etl.fromdb(connection, ...) t.dbo = another_connection

...however I would recommend trying to understand the cause of the error first.

Hth.

On Friday, May 5, 2017, coriolinus notifications@github.com wrote:

Yes, if anything obvious had jumped out, I would have attempted a solution. I'd hoped that possibly the stack trace would be useful to someone knowledgeable about PETL internals.

I guess the fundamental question here is, if I simply catch the error, is there any way to hand PETL a new connection object to the source db and try the .todb() function again, without needing to go back and re-do all the intermediate transforms? If so, I can work around this. If not, this becomes much harder to solve.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/alimanfoo/petl/issues/422#issuecomment-299394098, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QiN_-z6sN7Aid_Ac3CmO3bLlKkilks5r2sUCgaJpZM4NQuo- .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

elZagales commented 4 years ago

Hi I am getting the same error: pymysql.err.InterfaceError: (0, '')

code:

import petl
import pymysql
query = 'select * from columns where table_schema in ("employees");'
connection = pymysql.connect(host='localhost', port=3306, user='root', password='test', db='information_schema')
table = petl.fromdb(connection, query)
new_table = petl.cut(table, 'TABLE_SCHEMA')
print(new_table)

error: pymysql.err.InterfaceError: (0, '')

delubee commented 4 years ago

I am getting the same error,when i get all rows about 420000:

In [34]: table
Traceback (most recent call last):
  File "d:\ProgramData\Anaconda3\lib\site-packages\IPython\core\formatters.py", line 702, in __call__
    printer.pretty(obj)
  File "d:\ProgramData\Anaconda3\lib\site-packages\IPython\lib\pretty.py", line 405, in pretty
    return _repr_pprint(obj, self, cycle)
  File "d:\ProgramData\Anaconda3\lib\site-packages\IPython\lib\pretty.py", line 695, in _repr_pprint
    output = repr(obj)
  File "d:\ProgramData\Anaconda3\lib\site-packages\petl\util\vis.py", line 135, in _table_repr
    return str(look(table))
  File "d:\ProgramData\Anaconda3\lib\site-packages\petl\util\vis.py", line 104, in __repr__
    table, overflow = _vis_overflow(self.table, self.limit)
  File "d:\ProgramData\Anaconda3\lib\site-packages\petl\util\vis.py", line 528, in _vis_overflow
    table = list(islice(table, 0, limit+2))
  File "d:\ProgramData\Anaconda3\lib\site-packages\petl\io\db.py", line 156, in _iter_dbapi_connection
    for row in _iter_dbapi_cursor(cursor, query, *args, **kwargs):
  File "d:\ProgramData\Anaconda3\lib\site-packages\petl\io\db.py", line 163, in _iter_dbapi_cursor
    cursor.execute(query, *args, **kwargs)
  File "d:\ProgramData\Anaconda3\lib\site-packages\pymysql\cursors.py", line 170, in execute
    result = self._query(query)
  File "d:\ProgramData\Anaconda3\lib\site-packages\pymysql\cursors.py", line 328, in _query
    conn.query(q)
  File "d:\ProgramData\Anaconda3\lib\site-packages\pymysql\connections.py", line 517, in query
    self._affected_rows = self._read_query_result(unbuffered=unbuffered)
  File "d:\ProgramData\Anaconda3\lib\site-packages\pymysql\connections.py", line 732, in _read_query_result
    result.read()
  File "d:\ProgramData\Anaconda3\lib\site-packages\pymysql\connections.py", line 1075, in read
    first_packet = self.connection._read_packet()
  File "d:\ProgramData\Anaconda3\lib\site-packages\pymysql\connections.py", line 671, in _read_packet
    % (packet_number, self._next_seq_id))
InternalError: Packet sequence number wrong - got 139 expected 1
Traceback (most recent call last):
  File "d:\ProgramData\Anaconda3\lib\site-packages\IPython\core\formatters.py", line 345, in __call__
    return method()
  File "d:\ProgramData\Anaconda3\lib\site-packages\petl\util\vis.py", line 549, in _display_html
    table, overflow = _vis_overflow(table, limit)
  File "d:\ProgramData\Anaconda3\lib\site-packages\petl\util\vis.py", line 528, in _vis_overflow
    table = list(islice(table, 0, limit+2))
  File "d:\ProgramData\Anaconda3\lib\site-packages\petl\io\db.py", line 156, in _iter_dbapi_connection
    for row in _iter_dbapi_cursor(cursor, query, *args, **kwargs):
  File "d:\ProgramData\Anaconda3\lib\site-packages\petl\io\db.py", line 163, in _iter_dbapi_cursor
    cursor.execute(query, *args, **kwargs)
  File "d:\ProgramData\Anaconda3\lib\site-packages\pymysql\cursors.py", line 170, in execute
    result = self._query(query)
  File "d:\ProgramData\Anaconda3\lib\site-packages\pymysql\cursors.py", line 328, in _query
    conn.query(q)
  File "d:\ProgramData\Anaconda3\lib\site-packages\pymysql\connections.py", line 516, in query
    self._execute_command(COMMAND.COM_QUERY, sql)
  File "d:\ProgramData\Anaconda3\lib\site-packages\pymysql\connections.py", line 750, in _execute_command
    raise err.InterfaceError("(0, '')")
InterfaceError: (0, '')
_

but when i limit the rows to 1000 or
df=etl.todataframe(table) the error gone