ropensci / skimr

A frictionless, pipeable approach to dealing with summary statistics
https://docs.ropensci.org/skimr
1.11k stars 79 forks source link

`n` column removed for numeric variables? #505

Closed IndrajeetPatil closed 4 years ago

IndrajeetPatil commented 5 years ago

I am updating my package (https://github.com/IndrajeetPatil/groupedstats/issues/16) to work with skimr v2, but I realized that one of the key columns (for my purposes) has been removed: the number of data points (n).

Is there any particular reason why skimr::skim doesn't contain numeric.n?

> tibble::as_tibble(skimr::skim_to_wide(purrr::keep(iris, is.numeric)))
# A tibble: 4 x 13
  type    variable     missing complete n     mean   sd    p0    p25   p50    p75   p100  hist    
  <chr>   <chr>        <chr>   <chr>    <chr> <chr>  <chr> <chr> <chr> <chr>  <chr> <chr> <chr>   
1 numeric Petal.Length 0       150      150   3.76   1.77  "1  " 1.6   4.35   5.1   6.9   ▇▁▁▂▅▅▃▁
2 numeric Petal.Width  0       150      150   "1.2 " 0.76  0.1   0.3   "1.3 " 1.8   2.5   ▇▁▁▅▃▃▂▂
3 numeric Sepal.Length 0       150      150   5.84   0.83  4.3   5.1   "5.8 " 6.4   7.9   ▂▇▅▇▆▅▂▂
4 numeric Sepal.Width  0       150      150   3.06   0.44  "2  " 2.8   "3   " 3.3   4.4   ▁▂▅▇▃▂▁▁
> tibble::as_tibble(skimr::skim(purrr::keep(iris, is.numeric)))
# A tibble: 4 x 12
  skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75
  <chr>     <chr>             <int>         <dbl>        <dbl>      <dbl>      <dbl>       <dbl>       <dbl>       <dbl>
1 numeric   Sepal.Length          0             1         5.84      0.828        4.3         5.1        5.8          6.4
2 numeric   Sepal.Width           0             1         3.06      0.436        2           2.8        3            3.3
3 numeric   Petal.Length          0             1         3.76      1.77         1           1.6        4.35         5.1
4 numeric   Petal.Width           0             1         1.20      0.762        0.1         0.3        1.3          1.8
# ... with 2 more variables: numeric.p100 <dbl>, numeric.hist <chr>
michaelquinn32 commented 5 years ago

Hi Indrajeet!

That column was intentionally removed, as it duplicated information that is provided in the summary above.

I'm adding some functions now for getting this attribute.

# How many rows are in the original data frame?
skim(iris) %>% data_rows()

Best wishes, Michael

elinw commented 4 years ago

Of course you can also always add it as a custom function if you want it in the columns.

vinhtantran commented 4 years ago

I searched in the bug reports and saw this. I think 'n' column is still very useful when using skim() with group_by. There are cases where one wants to summarize a response variable based on different levels of a categorical variable.

data_row() is not helpful in this case because it only returns the data set's number of rows, not the number of rows in each level.

elinw commented 4 years ago

I agree it can be useful, but that's why skimr is so flexible in letting you add more statistics (or remove the ones you don't want). We will never have 100% agreement with all users about what should or shouldn't be in the default list. That's why summary() is frustrating leading us to made skimr easy to modify.

vinhtantran commented 4 years ago

Thank you for your reply. I spent some time looking for a suitable function to define in skim_with() for this purpose until I realized that the intended input of the function is a vector of numeric (in the case of numeric variables) and length() worked. If the desired type of the input is clarified in the documentation of skim_with(), it would extremely help.

michaelquinn32 commented 4 years ago

Hi!

You're looking for base_skimmer behavior: https://github.com/ropensci/skimr/blob/93aa2dcb0b47559fc8fb0d3575c71609d3ca763b/R/skim_with.R#L64

I'll provide another example in the docs.

elinw commented 4 years ago

Wouldn't the n() or count() function work?

On Mon, Jan 27, 2020 at 4:46 PM Michael Quinn notifications@github.com wrote:

Hi!

You're looking for base_skimmer behavior:

https://github.com/ropensci/skimr/blob/93aa2dcb0b47559fc8fb0d3575c71609d3ca763b/R/skim_with.R#L64

I'll provide another example in the docs.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/ropensci/skimr/issues/505?email_source=notifications&email_token=AAFYI7N75EVSN67QSD2ZF2LQ75I35A5CNFSM4JGJIGF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKBF4TQ#issuecomment-578969166, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFYI7P7CPTSEKYSVBMPUQTQ75I35ANCNFSM4JGJIGFQ .

vinhtantran commented 4 years ago

It doesn't work with skim_with though. I can call it separately and then join with the result of skim but it's more convenient to have them at once.

elinw commented 4 years ago

Can you share the code you used to create the new skimmer?

On Sat, Feb 1, 2020 at 1:28 PM vinhtantran notifications@github.com wrote:

It doesn't work with skim_with though. I can call it separately and then join with the result of skim but it's more convenient to have them at once.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/ropensci/skimr/issues/505?email_source=notifications&email_token=AAFYI7NBR5VN2563326UWF3RAW5NPA5CNFSM4JGJIGF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKRDQTA#issuecomment-581056588, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFYI7NXZEOAYHDXW3LUHATRAW5NPANCNFSM4JGJIGFQ .

vinhtantran commented 4 years ago

my_skim <- skim_with(base = sfl(n = length))

michaelquinn32 commented 4 years ago

Hi @vinhtantran!

I'm having trouble reproducing an issue. Can you provide a reprex?

> my_skim <- skim_with(base = sfl(n = length))
> my_skim(iris)
── Data Summary ────────────────────────
                           Values
Name                       iris  
Number of rows             150   
Number of columns          5     
_______________________          
Column type frequency:           
  factor                   1     
  numeric                  4     
________________________         
Group variables            None  

── Variable type: factor ───────────────────────────────────────────────────────
# A tibble: 1 x 5
  skim_variable     n ordered n_unique top_counts               
* <chr>         <int> <lgl>      <int> <chr>                    
1 Species         150 FALSE          3 set: 50, ver: 50, vir: 50

── Variable type: numeric ──────────────────────────────────────────────────────
# A tibble: 4 x 10
  skim_variable     n  mean    sd    p0   p25   p50   p75  p100 hist 
* <chr>         <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Sepal.Length    150  5.84 0.828   4.3   5.1  5.8    6.4   7.9 ▆▇▇▅▂
2 Sepal.Width     150  3.06 0.436   2     2.8  3      3.3   4.4 ▁▆▇▂▁
3 Petal.Length    150  3.76 1.77    1     1.6  4.35   5.1   6.9 ▇▁▆▇▂
4 Petal.Width     150  1.20 0.762   0.1   0.3  1.3    1.8   2.5 ▇▁▇▅▃
vinhtantran commented 4 years ago

As I commented above, the sample size is useful when skimmer is used on grouped data. The following code is how I would use my_skimmer defined above.

> iris %>% group_by(Species) %>% my_skim()
-- Data Summary ------------------------
                           Values    
Name                       Piped data
Number of rows             150       
Number of columns          5         
_______________________              
Column type frequency:               
  numeric                  4         
________________________             
Group variables            Species   

-- Variable type: numeric --------------------------------------------------------------------
# A tibble: 12 x 10
   skim_variable Species        n  mean    sd    p0   p25   p50   p75  p100
 * <chr>         <fct>      <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Sepal.Length  setosa        50 5.01  0.352   4.3  4.8   5     5.2    5.8
 2 Sepal.Length  versicolor    50 5.94  0.516   4.9  5.6   5.9   6.3    7  
 3 Sepal.Length  virginica     50 6.59  0.636   4.9  6.22  6.5   6.9    7.9
 4 Sepal.Width   setosa        50 3.43  0.379   2.3  3.2   3.4   3.68   4.4
 5 Sepal.Width   versicolor    50 2.77  0.314   2    2.52  2.8   3      3.4
 6 Sepal.Width   virginica     50 2.97  0.322   2.2  2.8   3     3.18   3.8
 7 Petal.Length  setosa        50 1.46  0.174   1    1.4   1.5   1.58   1.9
 8 Petal.Length  versicolor    50 4.26  0.470   3    4     4.35  4.6    5.1
 9 Petal.Length  virginica     50 5.55  0.552   4.5  5.1   5.55  5.88   6.9
10 Petal.Width   setosa        50 0.246 0.105   0.1  0.2   0.2   0.3    0.6
11 Petal.Width   versicolor    50 1.33  0.198   1    1.2   1.3   1.5    1.8
12 Petal.Width   virginica     50 2.03  0.275   1.4  1.8   2     2.3    2.5

Or if the response of interest is Sepal.Length, the following code will give an idea of how Sepal.Length looks like in each Species.

> iris %>% group_by(Species) %>% my_skim(Sepal.Length)
-- Data Summary ------------------------
                           Values    
Name                       Piped data
Number of rows             150       
Number of columns          5         
_______________________              
Column type frequency:               
  numeric                  1         
________________________             
Group variables            Species   

-- Variable type: numeric --------------------------------------------------------------------
# A tibble: 3 x 10
  skim_variable Species        n  mean    sd    p0   p25   p50   p75  p100
* <chr>         <fct>      <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Sepal.Length  setosa        50  5.01 0.352   4.3  4.8    5     5.2   5.8
2 Sepal.Length  versicolor    50  5.94 0.516   4.9  5.6    5.9   6.3   7  
3 Sepal.Length  virginica     50  6.59 0.636   4.9  6.22   6.5   6.9   7.9
elinw commented 4 years ago

No I was wrong but right now

my_skim <- skim_with(base = sfl(n = length))
iris %>% group_by(Species) %>% my_skim()

Seems to work. But that's for all variables not just numeric. If you want it just for numeric you'd change my_skim.

michaelquinn32 commented 4 years ago

I polled some skimr users on Twitter, and the results suggest that we shouldn't switch the defaults. https://twitter.com/michaelquinn32/status/1230317868811612161?s=20

Otherwise, previously-listed solutions should give you a my_skim() function that gets the results you want. Thanks for all of the feedback!