gvelasq / tidytab

Create tidyverse-friendly tables of frequencies
https://gvelasq.github.io/tidytab/
Other
18 stars 6 forks source link

cumulative percentage > 100% #21

Open zhizhongpu opened 5 months ago

zhizhongpu commented 5 months ago

Please briefly describe your problem and what output you expect. If you have a question, please don't use this form. Instead, ask on https://stackoverflow.com/ or https://community.rstudio.com/.

    city │      Freq.     Percent        Cum. 

──────────────────┼─────────────────────────────────── ancud │ 3 0.2 0.2 antofagasta │ 32 1.8 2.0 arauco │ 10 0.6 2.6 arica │ 14 0.8 3.4 aysen │ 2 0.1 3.5 buin │ 1 0.1 3.6 calbuco │ 2 0.1 3.7 carahue │ 3 0.2 3.9 casablanca │ 1 0.1 4.0 castro │ 1 0.1 4.1 cerro navia │ 2 0.1 4.2 chimbarongo │ 3 0.2 4.4 cobquecura │ 1 0.1 4.5 colbun │ 1 0.1 4.6 colchane │ 1 0.1 4.7 collipulli │ 9 0.5 5.2 concon │ 1 0.1 5.3 coyhaique │ 9 0.5 5.8 cunco │ 5 0.3 6.1 curacautin │ 1 0.1 6.2 curanilahue │ 4 0.2 6.4 curarrehue │ 1 0.1 6.5 dalcahue │ 2 0.1 6.6 el bosque │ 9 0.5 7.1 el quisco │ 1 0.1 7.2 ercilla │ 4 0.2 7.4 florida │ 15 0.8 8.2 futrono │ 4 0.2 8.4 galvarino │ 9 0.5 8.9 higgins │ 23 1.3 10.2 huasco │ 1 0.1 10.3 huechuraba │ 4 0.2 10.5 la ligua │ 1 0.1 10.6 la union │ 2 0.1 10.7 las condes │ 2 0.1 10.8 lautaro │ 10 0.6 11.4 lebu │ 6 0.3 11.7 limache │ 3 0.2 11.9 llanquihue │ 1 0.1 12.0 lo espejo │ 1 0.1 12.1 loncoche │ 1 0.1 12.2 los andes │ 1 0.1 12.3 los lagos │ 5 0.3 12.6 los vilos │ 1 0.1 12.7 lota │ 2 0.1 12.8 macul │ 2 0.1 12.9 marchihue │ 1 0.1 13.0 maule │ 3 0.2 13.2 mejillones │ 1 0.1 13.3 melipilla │ 1 0.1 13.4 mostazal │ 1 0.1 13.5 nancagua │ 1 0.1 13.6 nueva imperial │ 1 0.1 13.7 nunoa │ 5 0.3 14.0 paillaco │ 1 0.1 14.1 panguipulli │ 1 0.1 14.2 penalolen │ 1 0.1 14.3 petorca │ 3 0.2 14.5 pichidegua │ 1 0.1 14.6 pichilemu │ 3 0.2 14.8 pozo almonte │ 2 0.1 14.9 providencia │ 7 0.4 15.3 pudahuel │ 5 0.3 15.6 puente alto │ 7 0.4 16.0 puerto octay │ 1 0.1 16.1 puerto varas │ 1 0.1 16.2 purranque │ 2 0.1 16.3 quellon │ 1 0.1 16.4 quilicura │ 5 0.3 16.7 renca │ 1 0.1 16.8 rio bueno │ 3 0.2 17.0 rio negro │ 6 0.3 17.3 san antonio │ 3 0.2 17.5 san carlos │ 2 0.1 17.6 san gregorio │ 1 0.1 17.7 san javier │ 1 0.1 17.8 san joaquin │ 2 0.1 17.9 santa cruz │ 1 0.1 18.0 santiago │ 1488 82.0 100.0 sierra gorda │ 1 0.1 100.1 taltal │ 1 0.1 100.2 teodoro schmidt │ 1 0.1 100.3 tiltil │ 2 0.1 100.4 tirua │ 1 0.1 100.5 tocopilla │ 2 0.1 100.6 tome │ 1 0.1 100.7 tucapel │ 3 0.2 100.9 valdivia │ 6 0.3 101.2 vallenar │ 1 0.1 101.3 victoria │ 2 0.1 101.4 vilcun │ 1 0.1 101.5 villarrica │ 3 0.2 101.7 vitacura │ 4 0.2 101.9 yumbel │ 1 0.1 102.0 yungay │ 4 0.2 100.0 ──────────────────┼─────────────────────────────────── Total │ 1814 100.0

Please include a minimal reproducible example (AKA a reprex). If you've never heard of a reprex before, start by reading https://www.tidyverse.org/help/#reprex. i'm using a proprietary dataset, but i think the general problem holds whenever there're a large number of categories with relatively small percentage of the total For more advice on how to write a great issue, see https://code-review.tidyverse.org/issues/.

Brief description of the problem

# insert reprex here
gvelasq commented 5 months ago

@zhizhongpu, thank you for reporting this! I have confirmed it below. Would you like to propose a solution in a pull request?

library(tidytab)
library(tidyverse)
enframe(1:1010) %>% tab(value)
#>       value │      Freq.     Percent        Cum. 
#> ────────────┼───────────────────────────────────
#>           1 │          1         0.1         0.1 
#>           2 │          1         0.1         0.2 
#>           3 │          1         0.1         0.3 
#> (redacted)
#>         997 │          1         0.1        99.7 
#>         998 │          1         0.1        99.8 
#>         999 │          1         0.1        99.9 
#>        1000 │          1         0.1       100.0 
#>        1001 │          1         0.1       100.1 
#>        1002 │          1         0.1       100.2 
#>        1003 │          1         0.1       100.3 
#>        1004 │          1         0.1       100.4 
#>        1005 │          1         0.1       100.5 
#>        1006 │          1         0.1       100.6 
#>        1007 │          1         0.1       100.7 
#>        1008 │          1         0.1       100.8 
#>        1009 │          1         0.1       100.9 
#>        1010 │          1         0.1       100.0 
#> ────────────┼───────────────────────────────────
#>       Total │       1010       100.0           

Created on 2024-04-16 with reprex v2.1.0

For reference, this is how Stata outputs the same:

. tab value

      value |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          1        0.10        0.10
          2 |          1        0.10        0.20
          3 |          1        0.10        0.30
* (redacted)
        997 |          1        0.10       98.71
        998 |          1        0.10       98.81
        999 |          1        0.10       98.91
       1000 |          1        0.10       99.01
       1001 |          1        0.10       99.11
       1002 |          1        0.10       99.21
       1003 |          1        0.10       99.31
       1004 |          1        0.10       99.41
       1005 |          1        0.10       99.50
       1006 |          1        0.10       99.60
       1007 |          1        0.10       99.70
       1008 |          1        0.10       99.80
       1009 |          1        0.10       99.90
       1010 |          1        0.10      100.00
------------+-----------------------------------
      Total |      1,010      100.00

My preference would be to keep single decimals for the Percent and Cum. columns.

askawron commented 5 months ago

I think the problem comes from the function ftab in tab.R

# tab.R line 139
x <- dplyr::mutate(x, Percent = formatC(.data[["Freq."]] / sum(.data[["Freq."]]) * 100, digits = 1L,, format = "f"), Cum. = formatC(cumsum(.data[["Percent"]]), digits = 1L, format = "f"))

Created on 2024-04-17 with reprex v2.1.0

The problem comes from the digits = 1L because it rounds the percentages computed. This becomes an issue when the table in input is large as the percentages computed become either imprecise or close to zero. I would suggest to remove all the digits = 1L at this line and to round the final numbers at the end of the computation.