pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.58k stars 17.9k forks source link

Feature request: equivalent of R table or Stata tab #12597

Closed DonBeo closed 8 years ago

DonBeo commented 8 years ago

I think it would be useful to have a command similar to table in R or tab in Stata.

Given a vector v table(v) should return the frequency of each value in v. table(v1, v2) should return the cross tabulation of v1 and v2

TomAugspurger commented 8 years ago

We have pd.value_counts for the first one, and pd.crosstab for the second, which seems sufficient to me.

DonBeo commented 8 years ago

Thanks I was not aware of these two functions. They are probably enough then.

TomAugspurger commented 8 years ago

Oh, if you have an R background, we'd appreciate more documentation and examples here

jhconning commented 2 years ago

I would like to raise this request again and wish that the issue be reopened. The fact that pandas does not have a simple tabulation function with frequencies is one of the main barriers to adoption by new learners and a constant source of frustration to more advanced users.

Data analysis almost always requires simple exploratory frequency table tabulations. In a language such as Stata, it's absolutely trivial to get a simple useful frequency tabulation:

. tab spdlimit

Speed limit |      Freq.     Percent        Cum.
------------+-----------------------------------
         40 |          1        2.56        2.56
         45 |          3        7.69       10.26
         50 |          7       17.95       28.21
         55 |         15       38.46       66.67
         60 |         11       28.21       94.87
         65 |          1        2.56       97.44
         70 |          1        2.56      100.00
------------+-----------------------------------
      Total |         39      100.00

And it's similarly simple in R. But, as pointed out here by @chris1610 a minimum equivalent to produce the same output in pandas would be this thicket of code:

pd.concat([df['spdlimit'].value_counts().rename('count'), 
        df['spdlimit'].value_counts(normalize=True)
        .mul(100).rename('percentage')], axis=1)
count percentage
55 15 38.4615
60 11 28.2051
50 7 17.9487
45 3 7.69231
40 1 2.5641
65 1 2.5641
70 1 2.5641

(and more code needed still to get things properly sorted). @chris1610 has very usefully created the sidetable library to address some of this missing functionality https://github.com/chris1610/sidetable#freq

It would be really much better if this functionality was built into pandas. It seems easy to implement and it would be immediately useful and popular.

jreback commented 2 years ago

@TomAugspurger response is sufficient

these are also well documented

but if you want to add in the intro for R Users section ok