Skimming when all values are NA

elinw commented 3 years ago

Recently I came across a situation where all of the values of some variables were classed NA. In this case skimr

> df <- data.frame("x" = 1:10, "y" = NA   )
> df

── Data Summary ────────────────────────
                           Values
Name                       df    
Number of rows             10    
Number of columns          2     
_______________________          
Column type frequency:           
  logical                  1     
  numeric                  1     
________________________         
Group variables            None  

── Variable type: logical ───────────────────────────────────
  skim_variable n_missing complete_rate  mean count
1 y                    10             0   NaN ": " 

── Variable type: numeric ───────────────────────────────────
  skim_variable n_missing complete_rate  mean    sd    p0
1 x                     0             1   5.5  3.03     1
    p25   p50   p75  p100 hist 
1  3.25   5.5  7.75    10 ▇▇▇▇▇
> df <- data.frame("x" = 1:10, "y" = NA_integer_   )
> skimr::skim(df)
── Data Summary ────────────────────────
                           Values
Name                       df    
Number of rows             10    
Number of columns          2     
_______________________          
Column type frequency:           
  numeric                  2     
________________________         
Group variables            None  

── Variable type: numeric ───────────────────────────────────
  skim_variable n_missing complete_rate  mean    sd    p0
1 x                     0             1   5.5  3.03     1
2 y                    10             0 NaN   NA       NA
    p25   p50   p75  p100 hist   
1  3.25   5.5  7.75    10 "▇▇▇▇▇"
2 NA     NA   NA       NA " "    
>

I think the base columns are okay (n_missing, complte_rate) but probably we should not do the other statistics. @michaelquinn32 thoughts?

elinw commented 3 years ago

I guess it could be that we push the count to 0 so it works like the _NANUMERIC case.

michaelquinn32 commented 3 years ago

I think the issue is primarily how we handle NA's in some of the summary stats that we include: count and hist. We could probably add some simple updates to check if all the data is NA, and if so, have them return NAcharacter too. How does that sound?

ropensci / skimr

Skimming when all values are NA #666