petl-developers / petl

Python Extract Transform and Load Tables of Data
MIT License
1.24k stars 193 forks source link

Aggregate is broken when supplying single key in an array #552

Closed TinMarkovic closed 2 years ago

TinMarkovic commented 3 years ago

Minimal, reproducible code sample, a copy-pastable example if possible

import petl

table1 = [['foo', 'bar'],
          ['aaa', 1],
          ['aaa', 2],
          ['bbb', 3],
          ['bbb', 4],
          ['bbb', 5],
          ['ccc', 6]]

table2 = petl.aggregate(table1, key=["foo", ], aggregation=len)

print(table2)
"""
+-----+-------+---+---+
| foo | value |   |   |
+=====+=======+===+===+
| a   | a     | a | 2 |
+-----+-------+---+---+
| b   | b     | b | 3 |
+-----+-------+---+---+
| c   | c     | c | 1 |
+-----+-------+---+---+
"""

Problem description

Seems that iteration is broken somewhere inside aggregation when a key is supplied as a single element of a list. Happened upon it while using key programmatically and the user supplied a single value.

At the very least, should error/fail instead of acting this unpredictably.

Thanks for looking into it :)

Version and installation information

javidy commented 3 years ago

Hi,

I was able to reproduce the issue in my machine too. I'll attempt to fix it, would that be fine @juarezr ?

juarezr commented 3 years ago

Great! @javidy !

Thanks for your time.

javidy commented 3 years ago

Hi,

Following example worked just fine (when values of 'foo' are 'a' and 'b' instead of 'aaa', 'bbb'

>>> table1 = [['foo', 'bar'], ['a', 1], ['a', 2], ['b', 1], ['b', 2], ['b', 3]]
>>> table2 = etl.aggregate(table1, key=["foo", ], aggregation=len)
>>> table2
+-----+-------+
| foo | value |
+=====+=======+
| 'a' |     2 |
+-----+-------+
| 'b' |     3 |
+-----+-------+

Apparently, values of 'foo' somehow breaks the iteration as @TinMarkovic mentions in issue description. Debugging....

javidy commented 3 years ago

I investigated it a bit further, have some findings now. The issue seems to be connected with following lines in petl>transform>reductions.py (func: itersimpleaggregate)

def itersimpleaggregate(table, key, aggregation, value, field):
 .........

    # generate data
    if isinstance(key, (list, tuple)):
        for k, grp in rowgroupby(table, key, value):
            yield tuple(k) + (aggregation(grp),)
   ..........

Specifically, line where result is yielded. So when table1 is grouped by single key ('foo') itertools.groupby returns key as single element 'aaa' instead of tuple ('aaa',). Then tuple(k) turns 'aaa' into ('a', 'a', 'a'), hence the issue. I don't have at the moment the fix but at least we now know where exactly iterator gets broken. If you have ideas how we can fix it feel free to comment here :)

juarezr commented 3 years ago

Hi @alimanfoo,

If you have some spare time, could you help us to find the best approach for fixing this?

@javidy,

As @alimanfoo is the code original author, it would be interesting to listen to his opinion. But as he has been very busy, it's possible that he can't help us.

Maybe we could explore some possible solutions like:

  1. Splitting the yield line into a test checking if the variable k is a tuple as in the example below.
  2. Checking if the rowgroupby() shouldn't return a tuple for the case of the iterator returning one row.

After finding a proper solution, we can rely on the unit test for ensuring that we aren't introducing a regression.

Example fix

def itersimpleaggregate(table, key, aggregation, value, field):
   ##########

    # generate data
    if isinstance(key, (list, tuple)):
        for k, grp in rowgroupby(table, key, value):
            if isinstance(k, tuple):
                yield tuple(k) + (aggregation(grp),)
            else:
                yield tuple((k,)) + (aggregation(grp),)
   ##########
javidy commented 3 years ago

Thanks @juarezr , this sounds like a good plan to me and seems like it should fix the issue at least for this particular example.

I think following also should work. Have a look at additional condition len(key) > 1. So we can kind of test if single key is provided, if yes then it'll run code under else. It looks like that part of code was built for single-key scenarios?

###################

    # generate data
    if isinstance(key, (list, tuple)) and len(key) > 1:
        for k, grp in rowgroupby(table, key, value):
            yield tuple(k) + (aggregation(grp),)
    elif key is None:
        # special case counting
        if aggregation == len:
            yield nrows(table),
        else:
            yield aggregation(values(table, value)),
    else:
        for k, grp in rowgroupby(table, key, value):
            yield k, aggregation(grp)

####################
juarezr commented 3 years ago

Sorry for the delay. I'm a little busy right now.

I have two questions about the code above that we can clear:

  1. The shape of returned iterator/table/resultset is the same for the 4 yields?
  2. Considering the special case counting, there is any counting function already in petl? I am wondering where would be the best fit for this useful functionality.

I hadn't the time to go deep into the source code yet, so I am writing this from what I can remember. So consider the questions just as explorative. :smiley:

javidy commented 3 years ago

hi @juarezr ,

I'm also a little busy these days, sorry for late reply. I'll get on to that as soon as I get some free time.

MalayGoel commented 2 years ago

Hi @juarezr @javidy

I suggest adding the following (location line 262 in reductions.py):

if isinstance(key, (list, tuple)) and len(key) == 1:
    key = key[0]

As pointed out earlier by @javidy , it will be handled by else clause after this transformation.

isinstance(key, (list, tuple)) is being checked twice in petl>transform>reductions.py>itersimpleaggregate. It would make sense to effect this transformation before any of these conditions are evaluated.

juarezr commented 2 years ago

Hi @MalayGoel,

Would you mind sending a PR?

MalayGoel commented 2 years ago

@juarezr Created PR 577. Please review.

MalayGoel commented 2 years ago

@juarezr I hope we can close this issue.

juarezr commented 2 years ago

@MalayGoel, thanks.