Closed TinMarkovic closed 2 years ago
Hi,
I was able to reproduce the issue in my machine too. I'll attempt to fix it, would that be fine @juarezr ?
Great! @javidy !
Thanks for your time.
Hi,
Following example worked just fine (when values of 'foo' are 'a' and 'b' instead of 'aaa', 'bbb'
>>> table1 = [['foo', 'bar'], ['a', 1], ['a', 2], ['b', 1], ['b', 2], ['b', 3]]
>>> table2 = etl.aggregate(table1, key=["foo", ], aggregation=len)
>>> table2
+-----+-------+
| foo | value |
+=====+=======+
| 'a' | 2 |
+-----+-------+
| 'b' | 3 |
+-----+-------+
Apparently, values of 'foo' somehow breaks the iteration as @TinMarkovic mentions in issue description. Debugging....
I investigated it a bit further, have some findings now. The issue seems to be connected with following lines in petl>transform>reductions.py (func: itersimpleaggregate)
def itersimpleaggregate(table, key, aggregation, value, field):
.........
# generate data
if isinstance(key, (list, tuple)):
for k, grp in rowgroupby(table, key, value):
yield tuple(k) + (aggregation(grp),)
..........
Specifically, line where result is yielded. So when table1 is grouped by single key ('foo') itertools.groupby returns key as single element 'aaa' instead of tuple ('aaa',). Then tuple(k) turns 'aaa' into ('a', 'a', 'a'), hence the issue. I don't have at the moment the fix but at least we now know where exactly iterator gets broken. If you have ideas how we can fix it feel free to comment here :)
Hi @alimanfoo,
If you have some spare time, could you help us to find the best approach for fixing this?
@javidy,
As @alimanfoo is the code original author, it would be interesting to listen to his opinion. But as he has been very busy, it's possible that he can't help us.
Maybe we could explore some possible solutions like:
yield
line into a test checking if the variable k is a tuple as in the example below.rowgroupby()
shouldn't return a tuple for the case of the iterator returning one row.After finding a proper solution, we can rely on the unit test for ensuring that we aren't introducing a regression.
def itersimpleaggregate(table, key, aggregation, value, field):
##########
# generate data
if isinstance(key, (list, tuple)):
for k, grp in rowgroupby(table, key, value):
if isinstance(k, tuple):
yield tuple(k) + (aggregation(grp),)
else:
yield tuple((k,)) + (aggregation(grp),)
##########
Thanks @juarezr , this sounds like a good plan to me and seems like it should fix the issue at least for this particular example.
I think following also should work. Have a look at additional condition len(key) > 1. So we can kind of test if single key is provided, if yes then it'll run code under else. It looks like that part of code was built for single-key scenarios?
###################
# generate data
if isinstance(key, (list, tuple)) and len(key) > 1:
for k, grp in rowgroupby(table, key, value):
yield tuple(k) + (aggregation(grp),)
elif key is None:
# special case counting
if aggregation == len:
yield nrows(table),
else:
yield aggregation(values(table, value)),
else:
for k, grp in rowgroupby(table, key, value):
yield k, aggregation(grp)
####################
Sorry for the delay. I'm a little busy right now.
I have two questions about the code above that we can clear:
yields
?counting
function already in petl
? I am wondering where would be the best fit for this useful functionality.I hadn't the time to go deep into the source code yet, so I am writing this from what I can remember. So consider the questions just as explorative. :smiley:
hi @juarezr ,
I'm also a little busy these days, sorry for late reply. I'll get on to that as soon as I get some free time.
Hi @juarezr @javidy
I suggest adding the following (location line 262 in reductions.py):
if isinstance(key, (list, tuple)) and len(key) == 1:
key = key[0]
As pointed out earlier by @javidy , it will be handled by else clause after this transformation.
isinstance(key, (list, tuple))
is being checked twice in petl>transform>reductions.py>itersimpleaggregate
. It would make sense to effect this transformation before any of these conditions are evaluated.
Hi @MalayGoel,
Would you mind sending a PR?
@juarezr I hope we can close this issue.
@MalayGoel, thanks.
Minimal, reproducible code sample, a copy-pastable example if possible
Problem description
Seems that iteration is broken somewhere inside aggregation when a key is supplied as a single element of a list. Happened upon it while using key programmatically and the user supplied a single value.
At the very least, should error/fail instead of acting this unpredictably.
Thanks for looking into it :)
Version and installation information
petl.__version__ == 1.7.4
Python 3.8.5 (default, Jan 27 2021, 15:41:15) [GCC 9.3.0] on linux