pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.32k stars 17.81k forks source link

Combined value_counts() #16851

Closed drorata closed 7 years ago

drorata commented 7 years ago

Rather often, especially when exploring the data or creating reports, one is interested both in pd.Series.value_counts() and `pd.Series.value_counts(normalize=True). I prepared a branch which implements a wrapper which takes care of this (link)

TomAugspurger commented 7 years ago

This seems like a good fit for the cookbook, rather than a new method. Would you be interested in submitting a pull request to update https://github.com/pandas-dev/pandas/blob/master/doc/source/cookbook.rst with an example?

drorata commented 7 years ago

Personally, I found myself rewriting this function over and over again. Putting it in the cookbook won't solve this problem. I can understand why this is not a good fit to the core of pandas, but I'm wondering whether one can find a better fit. For example a pandas-utils library?

TomAugspurger commented 7 years ago

I'm not aware of any pandas-utils-like libraries, perhaps you could start one?

And to be clear, others in the community may disagree with me, and prefer that it is included in pandas. Let's wait and hear feedback from other maintainers and users.

mansenfranzen commented 7 years ago

Judging from a users perspective, I think it would be most convenient to simply add a parameter to the existing value_counts method like comb=True or extend the existing parameter normalize to take an additional argument like normalize="comb".

drorata commented 7 years ago

This was my first idea, and attempt as well. But, I didn't want to change the function too much eve though the tests passed. Therefore I thought having a wrapper is better. If the moderators find this approach good, I'm ready to update the PR.

jreback commented 7 years ago
In [5]: s = Series([1, 1, 2, 3])

In [6]: s
Out[6]: 
0    1
1    1
2    2
3    3
dtype: int64

In [7]: pd.concat([s.value_counts(), s.value_counts(normalize=True)], keys=['counts', 'normalized_counts'], axis=1)
Out[7]: 
   counts  normalized_counts
1       2               0.50
3       1               0.25
2       1               0.25

so you are looking for something like

s.value_counts(normalized='both') ?

How would you name these outputs?

drorata commented 7 years ago

My suggestions can be found here. As I mentioned above, I also implemented a version which adds another possible value for normalize, namely both. However, eventually, I discarded that approach as it changes the API of a core function.

I think, more generally, this option opens a new track for pandas, namely, function which outputs reports and not necessarily objection that can be re-used/piped/etc. Another idea I have is to add an option to compute the ration of the size of groups with respect to the whole. If you're interested I can open another issue dedicated for that proposal.

drorata commented 7 years ago

Is there any decision on this one? Would someone want to continue the discussion?

gfyoung commented 7 years ago

@drorata : We generally can make better informed decisions in this case when there's a PR submitted for review. I think that would be the best way to jump-start the discussion.

TomAugspurger commented 7 years ago

There's already a branch with the suggested implementation in the original post. I think the issue under discussion is whether or not it'd be a welcome addition to the API.

I'm still -1, but willing to be persuaded. @gfyoung do you have an opinion?

drorata commented 7 years ago

@TomAugspurger What is you take on having a change rather than an addition to the API? For example, allowing 'both' value for the keyword normalized?

gfyoung commented 7 years ago

@gfyoung do you have an opinion?

Honestly, I generally just call the method twice in cases like this :smile: , which I haven't found to be particularly inconvenient. Thus, I am indifferent.

drorata commented 7 years ago

@gfyoung and you probably wrap it in a pd.DataFrame ;)

gfyoung commented 7 years ago

Not necessarily. It depends on the situation actually.

drorata commented 7 years ago

This is really a minor thing. The only point I have is that if @gfyoung and I did this more than once, it means this is a reasonable use case. Thus, it would be nice to have it implemented.

jreback commented 7 years ago

as indicated above, as a cookbook recipie this would be fine.

drorata commented 7 years ago

I guess it is not really worthy. For future reference, I will leave the function I wrote here:

def value_counts_comb(self, sort=True, ascending=False,
                          bins=None, dropna=True):
        """
        A wrapper of value_counts which returns both the counts and the
        normalized view.

        The resulting object will be in descending order so that the
        first element is the most frequently-occurring element.
        Excludes NA values by default.

        Parameters
        ----------
        normalize : boolean, default False
            If True then the object returned will contain the relative
            frequencies of the unique values.
        sort : boolean, default True
            Sort by values
        ascending : boolean, default False
            Sort in ascending order
        bins : integer, optional
            Rather than count values, group them into half-open bins,
            a convenience for pd.cut, only works with numeric data
        dropna : boolean, default True
            Don't include counts of NaN.

        Returns
        -------
        counts : DataFrame
        """
        from pandas.core.algorithms import value_counts
        from pandas.core.reshape.concat import concat
        res_norm = value_counts(self, sort=sort, ascending=ascending,
                      normalize=True, bins=bins, dropna=dropna)
        res_regu = value_counts(self, sort=sort, ascending=ascending,
                      normalize=False, bins=bins, dropna=dropna)
        result = concat([res_norm, res_regu], axis=1, keys=['Ratio', 'Count'])
        return result