Open mark-andrews opened 2 years ago
You could, perhaps, kill two birds with one stone here. The name of the function only works if your aggregate function is a simple sum (I think having mean as the default is misleading). So what about a generic aggregate_scores function, along the lines that you've proposed, but then a family of shortcuts - total_scores
, sumlike_scores
, mean_scores
that don't take a method argument but do what they say.
There's an R / tidyverse tradition of doing this sort of thing, and I think it makes sense from a teaching perspective. If your students just need totals, just teach the shortcut. Introduce others, or the generic function only when needed, and once total is understood.
Also (I don't want to start new issues for these unless you think it's worthwhile): Thinking the same way, you could also make starts_with
the default selector. Students have control over their variable names. So again from a teaching point of view this engenders good practice. Even if they mess up variable naming, they are better off going through and relabelling than using different selection functions (which, realistically, you're not going to want to teach). So this would mean that your summing could be achieved with just. And also make .append
and .drop
true by default. Then, for 90% of introductory course use-cases students could just do...
total_scores(anxiety, depression, efficacy, sociability, stress)
I think this is consistent with the psyntur philosophy?
Also (I don't want to start new issues for these unless you think it's worthwhile): Thinking the same way, you could also make starts_with the default selector.
I do think it is worthwhile to have this as another issue. I want commits to be linked to issues as much as possible, and commits should be atomic so issues should be atomic. So new Issue is here: #40
So what about a generic aggregate_scores function, along the lines that you've proposed, but then a family of shortcuts - total_scores, sumlike_scores, mean_scores that don't take a method argument but do what they say.
Good idea. And yes, there is a tidyverse tradition of doing this. This is especially the case with purrr
I think.
The name of the function only works if your aggregate function is a simple sum (I think having mean as the default is misleading).
Yes, the name is currently not right. A more appropriate name for it, or at least for what it pretends to be, is something like aggregate_scores
(with "aggregate" being the verb and so having the proper pronunciation as such), where the method of aggregation is arbitrary. Or maybe aggregate_items
would be better than aggregate_scores
.
In that case, there could be a general purpose function named aggregate_scores
, and then special cases sum_scores
(or total_scores
), mean_scores
, etc.
And also make .append and .drop true by default.
We'll keep this as related to #38 and when that change is applied, I will close #38.
This implies that the intended purpose is to aggregate sets of variables in one specific way. I don't know if it is. As I see it, it is to produce the aggregate scores for all the sets of items that need to be aggregated. It is intended to be do that part of a typical psychometrics data data analysis workflow whereby all the constituent items are aggregated. In this step, we can assume we have items
x_1 ... x_n
,y_1 ... y_m
,z_1 ... z_k
, ....w_1 .... w_h
. So some unspecified number of sets, each with an unspecified number of constituent items. And for each set, we want to reduce it in some way to a single variable. Typically, the items in each set are reduced to their mean or to their sum, but there is no necessity that all sets are reduced using the same method, and in fact in general in psychometrics, the method of aggregation is specific to each scale. So in general then, we want all set of items (that need to be aggregated) to be aggregated in the way they should be aggregated.In my original conception of this problem, I wanted something along these lines:
In other words, it was using a general
mutate
command, and it allows arbitrary functions to be applied to sets of items that are selected using any of one the many ways we can select variable intidyverse
(starts_with
,ends_with
,contains
, by name, by index, consecutive sets of names, consecutive sets of indices, and so on). Now the above code does and and can not work. Then with the advent ofc_across
, I thought that was all we needed. And it is sort of is: you just need to do put your selections inc_across
, oh and don't forget to dorowwise
first, and remember toselect
at the end if you don't want everything in the returned data-frame and don't forget toungroup
either or something very weird might happen later.Now, that is what I would recommend for grown-ups, but the whole purpose of
psyntur
is to reduce code complexity and not require use of pipelines and not require the likes of functions likerowwise
andungroup
andc_across
, which inexperienced types will forget (they did) because they don't really make intuitive sense.So the along comes
total_scores
, and we accomplish everything in the above code as follows:And there was much rejoicing,
But notice how in the
c_across
version, we could easily put insum
for theanxiety_
items, andmean
for thedepression_
ones, and any other function for any other set of items. That is not possible intotal_scores
. They all use the same one.method
. Not only that but the value of.method
is a string that selects one of currently three options. We can not therefore do something likewhere
my_mad_function
is whatever I want.So dealing with those two issues, i.e.
is what I want to consider here.
For the first issue, here is what I have in mind:
And, of course, we just do
.method = 'sum'
, then it is'sum'
for all sets. And, of course, there is a default value for.method
too.And because it is a drag if you have, say, 10 sets of items and you want all to be summed except for one, for which we want the mean, then some variant like this might be desirable:
Here,
*
acts as a wildcard to mean "everything else".To deal with the second of the above issues, i.e. arbitrary functions, internally in
total_scores
, we could usec_across
(it was not actually required so far because only means and sums, more or less, were being used, and once the columns have been selected, you can just userowSums
orrowMeans
). Then any function can be used. So we could do something likewhere the values of
.method
are now functions, though we could still allow for.method = 'mean'
etc to allow code to backwards compatible.