petl-developers / petl

Python Extract Transform and Load Tables of Data
MIT License
1.22k stars 190 forks source link

petl todataframe() multiply run lambda function in addfield #578

Closed ikanashov closed 2 years ago

ikanashov commented 2 years ago

Minimal, reproducible code sample, a copy-pastable example if possible

import petl
z = 0
def tost(sql):
    global z
    z += 1
    print('z=', z)
return str(z)

table = [['md5', 'sql'], [1, 'select from *'], [2, 'select from tt'], [3, 'select from ddd']]
>>> petl.wrap(table)
+-----+-------------------+
| md5 | sql               |
+=====+===================+
|   1 | 'select from *'   |
+-----+-------------------+
|   2 | 'select from tt'  |
+-----+-------------------+
|   3 | 'select from ddd' |
+-----+-------------------+

>>> petl.wrap(table).addfield('tables', lambda row: tost(row['sql']))
z= 1
z= 2
z= 3
+-----+-------------------+--------+
| md5 | sql               | tables |
+=====+===================+========+
|   1 | 'select from *'   | '1'    |
+-----+-------------------+--------+
|   2 | 'select from tt'  | '2'    |
+-----+-------------------+--------+
|   3 | 'select from ddd' | '3'    |
+-----+-------------------+--------+

z = 0
>>> petl.wrap(table).addfield('tables', lambda row: tost(row['sql'])).todataframe()
z= 1
z= 2
z= 3
z= 4
z= 5
z= 6
z= 7
z= 8
z= 9
   md5              sql tables
0    1    select from *      7
1    2   select from tt      8
2    3  select from ddd      9
z = 0
>>> petl.wrap(table).addfield('tables', lambda row: tost(row['sql'])).tupleoftuples()
z= 1
z= 2
z= 3
(('md5', 'sql', 'tables'), (1, 'select from *', '1'), (2, 'select from tt', '2'), (3, 'select from ddd', '3'))

Problem description

When convert petl.Table to pandas.dataFrame lambda function in addfield run three times

Version and installation information

dnicolodi commented 2 years ago

The issue is caused by the implementation of todataframe() calling list() on the table. The list constructor in turns calls __len__() (twice, the second indirectly through __length_hint__()) and the implementation of __len__() for Petl objects is to iterate the table to get its length.

The issue is solved avoiding to call list() from todataframe() or to call list(iter(table)) instead. I'll prepare a PR later.