Open vorpalvorpal opened 3 years ago
Skimr is pretty column oriented and you're asking something row oriented. That said I think that sum(duplicated(x))
would give that number. Of course in many data sets it is expected that there will be repeats.
I think we can go a bit further. The most useful place for this would be in the summary, i.e. https://github.com/ropensci/skimr/blob/22dfec233021f0aba38c9f0bfc5cff62a946a3f9/R/summary.R#L12-L14
I think the implementation depends on how far we should push this.
I was thinking the same thing, i.e. should we make it customizable because this might be the first of many requests to add things. I do think that for our user scenario of "someone gives you a data set and you're trying to understand it" it might be very useful. If there are a lot of duplicates it might be smart to store it in a way that reflects that.
@michaelquinn32 if we are fixing issues on summary we could think about this one.
This is a little more than the current updates to the summary()
, since we'll need to modify the skim object to store this information. I can get to it soon.
What I was thinking is that eventually when we have a more flexible summary that would really allow a user to do this.
Could put this on the roadmap too.
Right now, the issue is that we generate all of the summary components as skimr attributes, which we then extract in the summary function.
For a 3.0, we could extend skim_with()
to provide a custom summary function. We could store the result of this as a single attribute in the skim_df
, and we might consider a custom print handling function (like in #667) or maybe we can simplify the output.
gt()
handles grouping variables.
http://www.danieldsjoberg.com/gt-and-gtsummary-presentation/#11
So we could require a summary function to produce
[stat group type] [stat name] [value]
Which should give a value that is pretty similar to we currently generate.
You could even think of a summary interface that is similar to skimr, basically using sfl`s.
my_skim <- skim_with(
.summary = skimr_summary_fun(
metatadata = sfl(
name = get_data_name,
group_variables = dplyr::groups
),
counts = sfl(
number_of_rows = nrow,
number_of_columns = length,
number_of_duplicate_rows = ~ sum(duplicated(.))
),
.include_column_types = TRUE
)
)
The last part is set as a function argument, since counting column types is something we currently do on the skim_df
result. The other option there would be to support name = function()
values, where function returns something that can be coerced into stat name - value
pairs, and the name
value becomes the name for the group. That's a lot more flexible, and probably could support summary functions that tell you which columns are most similar or something like that.
What do you think?
I just reread this and yes I really think that an sfl for summary would be the way to go.
Because I am fairly incompetent, I seem to keep introducing duplicate rows into my data frames. I was wonder if, in the initial data summary bit of the output of skim(), "duplicate rows" might be a useful additional metric.