pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.58k stars 17.9k forks source link

Shortcut for getting examples of the first few groups from a GroupBy? #9397

Open shoyer opened 9 years ago

shoyer commented 9 years ago

For visualization/testing purposes, I'm often interested in looking at the first example group(s) from a groupby operation.

Is there a convenient shortcut for this? The best I could come up with is pd.concat([x for _, x in itertools.islice(group, 3)]) which seemed awkward to me.

Note that this is a different use-case from .first()/.head(), which returns the first example from each group -- here I want full examples of the first few groups.

TomAugspurger commented 9 years ago

I typically do the same as you, but most often it's just a single group. In that case it's just group = next(iter(gr)) which isn't bad. We could overload __getitem__ so that gr[:5] is pretty much this, but I don't know if the use-case warrents that extra complexity.

cpcloud commented 9 years ago

This can be shortened by using the toolz library:

from toolz import take
pd.concat(list(take(3, group)))
jreback commented 9 years ago

Example of getting the last group here:

http://stackoverflow.com/questions/28694208/how-to-get-last-group-in-pandas-groupby/28695301#28695301

In [12]: df = pd.DataFrame({'a':['1','2','2','4','5','2'], 'b':np.random.randn(6)})

In [13]: g = df.groupby('a')

In [14]: g.groups
Out[14]: {'1': [0], '2': [1, 2, 5], '4': [3], '5': [4]}

In [15]: import itertools

In [16]: list(itertools.islice(g,len(g)-1,len(g)))
Out[16]: 
[('5',    a         b
  4  5 -0.644857)]
jorisvandenbossche commented 9 years ago

Do we want a convenience function here?

Just putting some random ideas:

cchwala commented 6 years ago

Getting groups by index would be useful for my application.

I have implemented a simple function with the name get_igroup() as suggested by @jorisvandenbossche. The implementation is based on how get_group() does it via _get_indices()


import pandas as pd
import numpy as np

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

s = pd.Series(np.random.randn(8), index=index)

print s

def get_igroup(g, i):
    """ Get grouby group by index

    g : pandas.grouby object
    i : int
    """

    keys = g.indices.keys()
    keys.sort()
    indices = g.indices.get(keys[i])
    return g._selected_obj.take(indices)

print '\n=================================='
print "Testing with `.groupby('first')` "
g = s.groupby('first')

for i in [0, 1, -1]:
    print '\n------------------------------- \nget_group for index=%d \n' % i
    print get_igroup(g, i).head()

print '\n=================================='
print "Testing with `.groupby(['first', 'second'])` "
g = s.groupby(['first', 'second'])

for i in [0, 1, -1]:
    print '\n------------------------------- \nget_group for index=%d \n' % i
    print get_igroup(g, i).head()

Output of the script above:

``` first second bar one -0.376155 two -0.521434 baz one -0.143541 two -0.043723 foo one 0.289646 two -0.716117 qux one -1.460004 two 0.729040 dtype: float64 ================================== Testing with `.groupby('first')` ------------------------------- get_group for index=0 first second bar one -0.376155 two -0.521434 dtype: float64 ------------------------------- get_group for index=1 first second baz one -0.143541 two -0.043723 dtype: float64 ------------------------------- get_group for index=-1 first second qux one -1.460004 two 0.729040 dtype: float64 ================================== Testing with `.groupby(['first', 'second'])` ------------------------------- get_group for index=0 first second bar one -0.376155 dtype: float64 ------------------------------- get_group for index=1 first second bar two -0.521434 dtype: float64 ------------------------------- get_group for index=-1 first second qux two 0.72904 dtype: float64 ```

@jreback If desired I could implement this or something similar via a PR. I am not sure about how much effort testing this would be, though.

WillAyd commented 6 years ago

I'm -1 on this as I don't think we make any guarantees about the ordering of groups within a groupby

cchwala commented 6 years ago

@WillAyd: Since I sort the group keys, the resulting order of the index retrieval should be deterministic even though the keys are unordered at first. The sorting also works correctly for tuple keys, as you can see in my example. But I understand that this might have some edge cases which lead to inconsistencies.