pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.73k stars 17.95k forks source link

DOC: update groupby NA group handing / workaround #5456

Closed mkeller-upb closed 2 years ago

mkeller-upb commented 11 years ago

Add more explicit docs / work-around for dealing with groupby and NA groups

(see comments)

Changelog: 07.Nov.2013: Add line to example below to preprocess table content.

I expect the following behavior: A DataFrame.groupby splits the dataframe/table into subtables according to the grouping-condition. A column name as a grouping-condition will give me subtables for each individual value in that column. Similarly, grouping with multiple columns (a list of column names) gives me a group for each occurring combination of these columns (or let me put it differently, the unique "values" of multiple columns to group for are tuples).

So if I'm wrong with my expectations, I couldn't read a different meaning or to-expect-behavior from the documentation (e.g. pandas.DataFrame.groupby.__doc__), then there is a lake of clarification.

Otherwise I found a bug and I am in the need for a fix: Some existing combinations are not provided with a group or splited subtable -- I checked it with drop_duplicates. And, finally, grouped.__iter__ ignores more/other combinations as grouped.groups.keys() -- Here, I also would expect, that both follows the same implementation...

I tracked it to the depth of pandas to pandas.core.Grouper._get_group_keys or better _KeyMapper.get_key, self.levelslooks good, but the list-comprehension-getmethod-zip-action goes wrong or eventually pandas.core.Grouper.group_info provides a too small ngroups value oorr something else.

pandas.__version__ : 0.12.0-1062-g3c57949 (from 6.11.2013) numpy.__version__ : 1.7.2 MacOSX 10.9

Test Example:


import pickle
import sys
import os

import pandas as pd

grp_cols = ['algorithm', 'customalpha']
df = "ccopy_reg\n_reconstructor\np0\n(cpandas.core.frame\nDataFrame\np1\nc__builtin__\nobject\np2\nNtp3\nRp4\ng0\n(cpandas.core.internals\nBlockManager\np5\ng2\nNtp6\nRp7\n((lp8\ncnumpy.core.multiarray\n_reconstruct\np9\n(cpandas.core.index\nIndex\np10\n(I0\ntp11\nS'b'\np12\ntp13\nRp14\n((I1\n(I2\ntp15\ncnumpy\ndtype\np16\n(S'O8'\np17\nI0\nI1\ntp18\nRp19\n(I3\nS'|'\np20\nNNNI-1\nI-1\nI63\ntp21\nbI00\n(lp22\nS'algorithm'\np23\naS'customalpha'\np24\natp25\n(Ntp26\ntp27\nbag9\n(cpandas.core.index\nInt64Index\np28\n(I0\ntp29\ng12\ntp30\nRp31\n((I1\n(I13\ntp32\ng16\n(S'i8'\np33\nI0\nI1\ntp34\nRp35\n(I3\nS'<'\np36\nNNNI-1\nI-1\nI0\ntp37\nbI00\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x04\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x05\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x08\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x0c\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\r\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x11\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x15\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x17\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x18\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x1a\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x1e\\x00\\x00\\x00\\x00\\x00\\x00\\x00 \\x00\\x00\\x00\\x00\\x00\\x00\\x00'\np38\ntp39\n(Ntp40\ntp41\nba(lp42\ng9\n(cnumpy\nndarray\np43\n(I0\ntp44\ng12\ntp45\nRp46\n(I1\n(I2\nI13\ntp47\ng16\n(S'O8'\np48\nI0\nI1\ntp49\nRp50\n(I3\nS'|'\np51\nNNNI-1\nI-1\nI63\ntp52\nbI00\n(lp53\nS'ScenarioAlgoLocalHeuristicM'\np54\naS'ScenarioAlgoLocalHeuristicM'\np55\naS'ScenarioAlgoLocalHeuristicM'\np56\naS'ScenarioAlgoLocalHeuristicM'\np57\naS'ScenarioAlgoLocalHeuristicMC'\np58\naS'ScenarioAlgoCMTFLP'\np59\naS'ScenarioAlgoLocalHeuristicMC'\np60\naS'ScenarioAlgoLocalHeuristicMC'\np61\naS'ScenarioAlgoLocalHeuristicMC'\np62\naS'ScenarioAlgoLocalHeuristicMC'\np63\naS'ScenarioAlgoLocalHeuristicM'\np64\naS'ScenarioAlgoLocalHeuristicM'\np65\naS'ScenarioAlgoLocalHeuristicMC'\np66\naS'exp'\np67\naS'r100'\np68\naNaS'r333'\np69\naNaNaS'r333'\np70\naS'r100'\np71\naS'linear'\np72\naS'exp'\np73\naS'r10'\np74\naS'linear'\np75\naS'r10'\np76\natp77\nba(lp78\ng9\n(g10\n(I0\ntp79\ng12\ntp80\nRp81\n((I1\n(I2\ntp82\ng19\nI00\n(lp83\ng23\nag24\natp84\n(Ntp85\ntp86\nbatp87\nbb."
df = pickle.loads(df)

# Unexpected behavior was caused by None - values (which are treaded as NaN values), thanks jreback
df.fillna("default", inplace=True) # replaces None/NaN values

print "raw data: (", len(df), ")\n", df
print
print

df_grps1 = df[grp_cols].drop_duplicates()
df_grps2 = df.groupby(grp_cols)
df_grps3 = [grp for grp, _ in df.groupby(grp_cols)]

print "df_grps1 (#", len(df_grps1), "): \n", df_grps1
print
print "df_grps2 (#", len(df_grps2), "): "
for tpl in df_grps2.groups.keys():
    print tpl
print
print "df_grps3 (#", len(df_grps3), "): "
for tpl in df_grps3:
    print tpl

assert len(df_grps1) == len(df_grps2), "baad bug !!!"
assert len(df_grps2) == len(df_grps3), "baad bug !!!"
assert len(df_grps1) == len(df_grps3), "baad bug!!!"

print "passed without error"
jreback commented 11 years ago

you have None in your groups, which are dropped, see here

if you add df = df.fillna('foo') at after you unpickle your script will work fine.

The way to 'solve' this problem is to fill the groups with a string, group, perform your operation, then if you really-really want a nan in an index (which in general in allowed, but makes indexing almost impossible), then you can set those strings back to nan.

mkeller-upb commented 11 years ago

That's good. So it's not a bug and everything is much easier. To remaining points: a) I recommend to add a hint for this behavior in pandas.DataFrame.groupby.__doc__. b) And mention at all three documentation places (your tutorial link, DataFrame, fillna, that , NaN, and Na are treaded similar.

Thanks a lot for your quick answer, jreback.

jreback commented 11 years ago

ok...will convert this issue to a doc updating one then...thanks for the comments

springcoil commented 9 years ago

I'm adding something to this - just to bring this up the list. So what exactly has to be done - there needs to be a Doc change to the docs itself or the docstring as well?

jreback commented 9 years ago

would add an example of how to work around (like the above), here

mroeschke commented 2 years ago

As described in https://github.com/pandas-dev/pandas/pull/47337#pullrequestreview-1005333109=, there is dropna=False which will keep the NA groups now so closing