Add percentiles to describe(), maybe create a new function scale_reverse()

joon-e / tidycomm

tidycomm: Data Modification and Analysis for Communication Research

https://joon-e.github.io/tidycomm/

GNU General Public License v3.0

15 stars 5 forks source link

Add percentiles to describe(), maybe create a new function scale_reverse() #30

Closed LKobilke closed 1 year ago

LKobilke commented 1 year ago

In today's "continuing education for employees" it came up that it would be nice to include percentiles in describe(). Maybe upper/lower 10%.

In addition, we talked about data transformation and that a function to reverse scales would be a helpful addition to tidycomm (to avoid the manual labor in dplyr).

MarHai commented 1 year ago

re percentiles: this broadens the view quite significantly and is also not default expectation, I would think? we could include it as a param with default percentiles = c(.25, .50, .75) which translates to current Q25/Mdn/Q75. If other options are included in this list, we could append them to the returning tibble.

re data transformation and scale reversing: this is a quite common thing to do, also/particularly for our students. i would like to think that this not only includes reversing (i.e., turning 1-5 into 5-1) but also (z-)standardizing (for which I commonly use rescale). maybe two functions are at request for this? scale_reverse and scale_z?

LKobilke commented 1 year ago

Regarding percentiles: I agree that it is not the default expectation and that it's likely to clutter the tibble. However, I understand why colleagues might miss having these percentiles since we do ask students to calculate them by hand. A possible solution that would allow us to keep the view slim might be to create a new parameter that defaults to the five-number summary (e.g., five_num_summary = TRUE as default). When set to FALSE, we would return a tibble that excludes the five most important percentiles and the range value, but includes the .10, .20, .30, .40, .60, .70, .80, .90 percentiles. This would mean swapping 6 columns for 8, which seems acceptable.

Regarding scaling: That's a great idea! We should adopt a tidy naming approach for our functions. Perhaps we could consider names such as reverse_scale(), center_scale(), and standardize_scale()?

MarHai commented 1 year ago

Percentiles: It comes down to the same result, I suppose. Question is whether we want a more technical argument (as in percentiles = c(.25, .50, .75)) or a more telling one (as in five_num_summary = TRUE). I have a slight tendency to the first but am fine with the second as well.

Scaling: I've implemented four scaling functions into #33, waiting for review.

joon-e commented 1 year ago

If we are to implement further percentiles in describe(), I would strongly argue in favor of the more technical argument. As a user, I would certainly not expect five_num_summary = FALSE to remove the range (in addition to min, max, mdn, q25 and q50) but add 8 other percentiles instead.

But I also like that describe() is opinionated in that it provides the parameters that are deemed to be most important. If users can provide their own input for the percentiles, why not also for the confidence interval(s). or exclude the other moments, or calculate a trimmed mean, etc.?

Maybe a specific calculate_percentiles() (or a better named) function would be more suited for getting custom percentiles (this could also provide a nice visualization, a density curve with the percentiles drawn in).