ropensci / skimr

A frictionless, pipeable approach to dealing with summary statistics
https://docs.ropensci.org/skimr
1.11k stars 78 forks source link

Summarize character vectors like factors #743

Closed jxu closed 4 months ago

jxu commented 4 months ago

Readr's read_csv reads all strings as characters, with no stringsAsFactors switch. This is fine, but when using skimr I've found treating the strings as factors almost always gives more useful results: min/max are string lengths, which isn't useful for categorical levels, while top_counts is much more handy. Maybe there should be a strings_like_factors switch which can quickly report top counts instead of min/max length and counts of empty/whitespace. Otherwise the current dplyr method to convert all chars to factors is something like df %>% mutate(across(where(is.character), factor)).

michaelquinn32 commented 4 months ago

Thanks for the suggestion!

We should be able to support this through custom skim functions. Here's an example:

my_skim <- skim_with(character = get_sfl('factor'), append = FALSE)

data.frame(affiliations = c("Dem", "Dem", "Rep", "Rep", "Ind", "Lib")) |>
  my_skim() |>
  print()

image

jxu commented 4 months ago

Nice solution. I think it is a suitable option.