ropensci / skimr

A frictionless, pipeable approach to dealing with summary statistics
https://docs.ropensci.org/skimr
1.11k stars 79 forks source link

Support configurable summary() #624

Open vorpalvorpal opened 3 years ago

vorpalvorpal commented 3 years ago

Because I am fairly incompetent, I seem to keep introducing duplicate rows into my data frames. I was wonder if, in the initial data summary bit of the output of skim(), "duplicate rows" might be a useful additional metric.

elinw commented 3 years ago

Skimr is pretty column oriented and you're asking something row oriented. That said I think that sum(duplicated(x)) would give that number. Of course in many data sets it is expected that there will be repeats.

michaelquinn32 commented 3 years ago

I think we can go a bit further. The most useful place for this would be in the summary, i.e. https://github.com/ropensci/skimr/blob/22dfec233021f0aba38c9f0bfc5cff62a946a3f9/R/summary.R#L12-L14

I think the implementation depends on how far we should push this.

elinw commented 3 years ago

I was thinking the same thing, i.e. should we make it customizable because this might be the first of many requests to add things. I do think that for our user scenario of "someone gives you a data set and you're trying to understand it" it might be very useful. If there are a lot of duplicates it might be smart to store it in a way that reflects that.

elinw commented 2 years ago

@michaelquinn32 if we are fixing issues on summary we could think about this one.

michaelquinn32 commented 2 years ago

This is a little more than the current updates to the summary(), since we'll need to modify the skim object to store this information. I can get to it soon.

elinw commented 2 years ago

What I was thinking is that eventually when we have a more flexible summary that would really allow a user to do this.

michaelquinn32 commented 2 years ago

Could put this on the roadmap too.

Right now, the issue is that we generate all of the summary components as skimr attributes, which we then extract in the summary function.

For a 3.0, we could extend skim_with() to provide a custom summary function. We could store the result of this as a single attribute in the skim_df, and we might consider a custom print handling function (like in #667) or maybe we can simplify the output.

gt() handles grouping variables. http://www.danieldsjoberg.com/gt-and-gtsummary-presentation/#11

So we could require a summary function to produce

[stat group type] [stat name] [value]

Which should give a value that is pretty similar to we currently generate.

You could even think of a summary interface that is similar to skimr, basically using sfl`s.

my_skim <- skim_with(
  .summary = skimr_summary_fun(
    metatadata = sfl(
      name = get_data_name,
      group_variables = dplyr::groups
    ),
    counts = sfl(
      number_of_rows = nrow,
      number_of_columns = length,
      number_of_duplicate_rows = ~ sum(duplicated(.))
    ),
    .include_column_types = TRUE
  )
)

The last part is set as a function argument, since counting column types is something we currently do on the skim_df result. The other option there would be to support name = function() values, where function returns something that can be coerced into stat name - value pairs, and the name value becomes the name for the group. That's a lot more flexible, and probably could support summary functions that tell you which columns are most similar or something like that.

What do you think?

elinw commented 1 year ago

I just reread this and yes I really think that an sfl for summary would be the way to go.