Multiple arbitrary aggregation methods in `total_scores`

mark-andrews commented 2 years ago

I definitely wouldn’t add the option of different aggregates in the same function. .... total_scores aggregates several sets of variables in one specific way. If you want to aggregate different variables in different ways, you use it more than once. That seems intuitive to me.

This implies that the intended purpose is to aggregate sets of variables in one specific way. I don't know if it is. As I see it, it is to produce the aggregate scores for all the sets of items that need to be aggregated. It is intended to be do that part of a typical psychometrics data data analysis workflow whereby all the constituent items are aggregated. In this step, we can assume we have items x_1 ... x_n, y_1 ... y_m, z_1 ... z_k, .... w_1 .... w_h. So some unspecified number of sets, each with an unspecified number of constituent items. And for each set, we want to reduce it in some way to a single variable. Typically, the items in each set are reduced to their mean or to their sum, but there is no necessity that all sets are reduced using the same method, and in fact in general in psychometrics, the method of aggregation is specific to each scale. So in general then, we want all set of items (that need to be aggregated) to be aggregated in the way they should be aggregated.

In my original conception of this problem, I wanted something along these lines:

mutate(data_df, 
            x = sum(starts_with('x_')),
            x = mean(starts_with('x_')),
            z = median(contain('foo')),
)

In other words, it was using a general mutate command, and it allows arbitrary functions to be applied to sets of items that are selected using any of one the many ways we can select variable in tidyverse (starts_with, ends_with, contains, by name, by index, consecutive sets of names, consecutive sets of indices, and so on). Now the above code does and and can not work. Then with the advent of c_across, I thought that was all we needed. And it is sort of is: you just need to do put your selections in c_across, oh and don't forget to do rowwise first, and remember to select at the end if you don't want everything in the returned data-frame and don't forget to ungroup either or something very weird might happen later.

psymetr_df_total <- psymetr_df_fix %>% 
  rowwise() %>% 
  mutate(anxiety = sum(c_across(starts_with('anxiety_'))),
         depression = sum(c_across(starts_with('depression_'))),
         efficacy = sum(c_across(starts_with('efficacy_'))),
         sociability = sum(c_across(starts_with('sociability_'))),
         stress = sum(c_across(starts_with('stress')))) %>% 
  select(anxiety, depression, efficacy, sociability, stress) %>%
  ungroup()

Now, that is what I would recommend for grown-ups, but the whole purpose of psyntur is to reduce code complexity and not require use of pipelines and not require the likes of functions like rowwise and ungroup and c_across, which inexperienced types will forget (they did) because they don't really make intuitive sense.

So the along comes total_scores, and we accomplish everything in the above code as follows:

psymetr_df_total <- total_scores(psymetr_df_fix,
                                 anxiety = starts_with('anxiety_'),
                                 depression = starts_with('depression_'),
                                 efficacy = starts_with('efficacy_'),
                                 sociability = starts_with('sociability_'),
                                 stress = starts_with('stress_'),
                                 .method = 'sum'
)

And there was much rejoicing,

But notice how in the c_across version, we could easily put in sum for the anxiety_ items, and mean for the depression_ ones, and any other function for any other set of items. That is not possible in total_scores. They all use the same one .method. Not only that but the value of .method is a string that selects one of currently three options. We can not therefore do something like

psymetr_df_total <- total_scores(psymetr_df_fix,
                                 anxiety = starts_with('anxiety_'),
                                 ...
                                 .method = my_mad_function
)

where my_mad_function is whatever I want.

So dealing with those two issues, i.e.

different functions for different sets of items
arbitrary functions as methods of aggregation

is what I want to consider here.

For the first issue, here is what I have in mind:

psymetr_df_total <- total_scores(psymetr_df_fix,
                                 anxiety = starts_with('anxiety_'),
                                 depression = starts_with('depression_'),
                                 efficacy = starts_with('efficacy_'),
                                 sociability = starts_with('sociability_'),
                                 stress = starts_with('stress_'),
                                 .method =c(anxiety = 'sum', depression = 'mean', efficacy = 'sum'
                                            sociability = 'mean', stress = 'sum_like')
)

And, of course, we just do .method = 'sum', then it is 'sum' for all sets. And, of course, there is a default value for .method too.

And because it is a drag if you have, say, 10 sets of items and you want all to be summed except for one, for which we want the mean, then some variant like this might be desirable:

psymetr_df_total <- total_scores(psymetr_df_fix,
                                 anxiety = starts_with('anxiety_'),
                                 depression = starts_with('depression_'),
                                 efficacy = starts_with('efficacy_'),
                                 sociability = starts_with('sociability_'),
                                 stress = starts_with('stress_'),
                                 .method =c(anxiety = 'mean', `*` = 'sum')
)

Here, * acts as a wildcard to mean "everything else".

To deal with the second of the above issues, i.e. arbitrary functions, internally in total_scores, we could use c_across (it was not actually required so far because only means and sums, more or less, were being used, and once the columns have been selected, you can just use rowSums or rowMeans). Then any function can be used. So we could do something like

psymetr_df_total <- total_scores(psymetr_df_fix,
                                 anxiety = starts_with('anxiety_'),
                                 depression = starts_with('depression_'),
                                 efficacy = starts_with('efficacy_'),
                                 sociability = starts_with('sociability_'),
                                 stress = starts_with('stress_'),
                                 .method =c(anxiety = sum, depression = mean, efficacy = median
                                            sociability = my_mad_function, stress = ~sum(.)/length(.))
)

where the values of .method are now functions, though we could still allow for .method = 'mean' etc to allow code to backwards compatible.

Mark-Torrance commented 2 years ago

You could, perhaps, kill two birds with one stone here. The name of the function only works if your aggregate function is a simple sum (I think having mean as the default is misleading). So what about a generic aggregate_scores function, along the lines that you've proposed, but then a family of shortcuts - total_scores, sumlike_scores, mean_scores that don't take a method argument but do what they say.

There's an R / tidyverse tradition of doing this sort of thing, and I think it makes sense from a teaching perspective. If your students just need totals, just teach the shortcut. Introduce others, or the generic function only when needed, and once total is understood.

Also (I don't want to start new issues for these unless you think it's worthwhile): Thinking the same way, you could also make starts_with the default selector. Students have control over their variable names. So again from a teaching point of view this engenders good practice. Even if they mess up variable naming, they are better off going through and relabelling than using different selection functions (which, realistically, you're not going to want to teach). So this would mean that your summing could be achieved with just. And also make .append and .drop true by default. Then, for 90% of introductory course use-cases students could just do...

total_scores(anxiety, depression, efficacy, sociability, stress)

I think this is consistent with the psyntur philosophy?

mark-andrews commented 2 years ago

Also (I don't want to start new issues for these unless you think it's worthwhile): Thinking the same way, you could also make starts_with the default selector.

I do think it is worthwhile to have this as another issue. I want commits to be linked to issues as much as possible, and commits should be atomic so issues should be atomic. So new Issue is here: #40

So what about a generic aggregate_scores function, along the lines that you've proposed, but then a family of shortcuts - total_scores, sumlike_scores, mean_scores that don't take a method argument but do what they say.

Good idea. And yes, there is a tidyverse tradition of doing this. This is especially the case with purrr I think.

The name of the function only works if your aggregate function is a simple sum (I think having mean as the default is misleading).

Yes, the name is currently not right. A more appropriate name for it, or at least for what it pretends to be, is something like aggregate_scores (with "aggregate" being the verb and so having the proper pronunciation as such), where the method of aggregation is arbitrary. Or maybe aggregate_items would be better than aggregate_scores.

In that case, there could be a general purpose function named aggregate_scores, and then special cases sum_scores (or total_scores), mean_scores, etc.

And also make .append and .drop true by default.

We'll keep this as related to #38 and when that change is applied, I will close #38.

mark-andrews / psyntur

Multiple arbitrary aggregation methods in `total_scores` #39