Closed drorata closed 7 years ago
This seems like a good fit for the cookbook, rather than a new method. Would you be interested in submitting a pull request to update https://github.com/pandas-dev/pandas/blob/master/doc/source/cookbook.rst with an example?
Personally, I found myself rewriting this function over and over again. Putting it in the cookbook won't solve this problem. I can understand why this is not a good fit to the core of pandas, but I'm wondering whether one can find a better fit. For example a pandas-utils
library?
I'm not aware of any pandas-utils
-like libraries, perhaps you could start one?
And to be clear, others in the community may disagree with me, and prefer that it is included in pandas. Let's wait and hear feedback from other maintainers and users.
Judging from a users perspective, I think it would be most convenient to simply add a parameter to the existing value_counts
method like comb=True
or extend the existing parameter normalize
to take an additional argument like normalize="comb"
.
This was my first idea, and attempt as well. But, I didn't want to change the function too much eve though the tests passed. Therefore I thought having a wrapper is better. If the moderators find this approach good, I'm ready to update the PR.
In [5]: s = Series([1, 1, 2, 3])
In [6]: s
Out[6]:
0 1
1 1
2 2
3 3
dtype: int64
In [7]: pd.concat([s.value_counts(), s.value_counts(normalize=True)], keys=['counts', 'normalized_counts'], axis=1)
Out[7]:
counts normalized_counts
1 2 0.50
3 1 0.25
2 1 0.25
so you are looking for something like
s.value_counts(normalized='both')
?
How would you name these outputs?
My suggestions can be found here. As I mentioned above, I also implemented a version which adds another possible value for normalize
, namely both
. However, eventually, I discarded that approach as it changes the API of a core function.
I think, more generally, this option opens a new track for pandas, namely, function which outputs reports and not necessarily objection that can be re-used/piped/etc. Another idea I have is to add an option to compute the ration of the size of groups with respect to the whole. If you're interested I can open another issue dedicated for that proposal.
Is there any decision on this one? Would someone want to continue the discussion?
@drorata : We generally can make better informed decisions in this case when there's a PR submitted for review. I think that would be the best way to jump-start the discussion.
There's already a branch with the suggested implementation in the original post. I think the issue under discussion is whether or not it'd be a welcome addition to the API.
I'm still -1, but willing to be persuaded. @gfyoung do you have an opinion?
@TomAugspurger What is you take on having a change rather than an addition to the API? For example, allowing 'both'
value for the keyword normalized
?
@gfyoung do you have an opinion?
Honestly, I generally just call the method twice in cases like this :smile: , which I haven't found to be particularly inconvenient. Thus, I am indifferent.
@gfyoung and you probably wrap it in a pd.DataFrame
;)
Not necessarily. It depends on the situation actually.
This is really a minor thing. The only point I have is that if @gfyoung and I did this more than once, it means this is a reasonable use case. Thus, it would be nice to have it implemented.
as indicated above, as a cookbook recipie this would be fine.
I guess it is not really worthy. For future reference, I will leave the function I wrote here:
def value_counts_comb(self, sort=True, ascending=False,
bins=None, dropna=True):
"""
A wrapper of value_counts which returns both the counts and the
normalized view.
The resulting object will be in descending order so that the
first element is the most frequently-occurring element.
Excludes NA values by default.
Parameters
----------
normalize : boolean, default False
If True then the object returned will contain the relative
frequencies of the unique values.
sort : boolean, default True
Sort by values
ascending : boolean, default False
Sort in ascending order
bins : integer, optional
Rather than count values, group them into half-open bins,
a convenience for pd.cut, only works with numeric data
dropna : boolean, default True
Don't include counts of NaN.
Returns
-------
counts : DataFrame
"""
from pandas.core.algorithms import value_counts
from pandas.core.reshape.concat import concat
res_norm = value_counts(self, sort=sort, ascending=ascending,
normalize=True, bins=bins, dropna=dropna)
res_regu = value_counts(self, sort=sort, ascending=ascending,
normalize=False, bins=bins, dropna=dropna)
result = concat([res_norm, res_regu], axis=1, keys=['Ratio', 'Count'])
return result
Rather often, especially when exploring the data or creating reports, one is interested both in
pd.Series.value_counts()
and `pd.Series.value_counts(normalize=True). I prepared a branch which implements a wrapper which takes care of this (link)