ropensci / skimr

A frictionless, pipeable approach to dealing with summary statistics
https://docs.ropensci.org/skimr
1.1k stars 79 forks source link

Skimming when all values are NA #666

Open elinw opened 3 years ago

elinw commented 3 years ago

Recently I came across a situation where all of the values of some variables were classed NA. In this case skimr

> df <- data.frame("x" = 1:10, "y" = NA   )
> df

── Data Summary ────────────────────────
                           Values
Name                       df    
Number of rows             10    
Number of columns          2     
_______________________          
Column type frequency:           
  logical                  1     
  numeric                  1     
________________________         
Group variables            None  

── Variable type: logical ───────────────────────────────────
  skim_variable n_missing complete_rate  mean count
1 y                    10             0   NaN ": " 

── Variable type: numeric ───────────────────────────────────
  skim_variable n_missing complete_rate  mean    sd    p0
1 x                     0             1   5.5  3.03     1
    p25   p50   p75  p100 hist 
1  3.25   5.5  7.75    10 ▇▇▇▇▇
> df <- data.frame("x" = 1:10, "y" = NA_integer_   )
> skimr::skim(df)
── Data Summary ────────────────────────
                           Values
Name                       df    
Number of rows             10    
Number of columns          2     
_______________________          
Column type frequency:           
  numeric                  2     
________________________         
Group variables            None  

── Variable type: numeric ───────────────────────────────────
  skim_variable n_missing complete_rate  mean    sd    p0
1 x                     0             1   5.5  3.03     1
2 y                    10             0 NaN   NA       NA
    p25   p50   p75  p100 hist   
1  3.25   5.5  7.75    10 "▇▇▇▇▇"
2 NA     NA   NA       NA " "    
> 

I think the base columns are okay (n_missing, complte_rate) but probably we should not do the other statistics. @michaelquinn32 thoughts?

elinw commented 3 years ago

I guess it could be that we push the count to 0 so it works like the _NANUMERIC case.

michaelquinn32 commented 3 years ago

I think the issue is primarily how we handle NA's in some of the summary stats that we include: count and hist. We could probably add some simple updates to check if all the data is NA, and if so, have them return NAcharacter too. How does that sound?