singmann / afex

Analysis of Factorial EXperiments (R package)
119 stars 32 forks source link

Argument order in aov_ez #2

Open dgromer opened 9 years ago

dgromer commented 9 years ago

Is there a specific reason why id is the first argument in aov_ez, and data is the third?

It would make more sense to me if data was the first argument, because then it would fit nicely into pipelines like

data %>%
  dplyr::filter(some_filtering_here) %>%
  aov_ez("id", "dv", further_arguments_here)

instead of now

data %>%
  dplyr::filter(some_filtering_here) %>%
  aov_ez("id", "dv", ., further_arguments_here)
singmann commented 9 years ago

I think the reason for having both id and dv before data was to clearly separate those required arguments from the two more or less optional arguments between and within. But I see that even without dplyr it could make sense to have data as first argument, e.g., when using lapply. On the other hand, this design would easily allow to run ANOVAs on many subsets of the data which should in principle only exacerbate the existing problem of Type I error accumulation (which exists for any multifactor ANOVA).

I also have just recently made some rather strong changes to the interface so I am not sure now is the time for the next ones, but it is any idea I will keep in mind. Perhaps only adding an alternative version with changed ordering of arguments could work for the time being, e.g., aov_ez2.

dgromer commented 9 years ago

Having data as the first argument seems more intuitive to me, since then all following arguments clearly refer to this data frame. Right now, the first two arguments are somewhat out of context in my opinon. And aov_ez would be more similar to ez::ezANOVA ;)

However, I wouldn't include the aov_ez2 wrapper, because it could make things more complicated for people starting with the package.

singmann commented 9 years ago

I am somewhat inclined to make this change. The only problem is, this will really break a lot of existing code using this function, so it would be quite a big change.

I have some plans for making some rather drastic changes (e.g., harmonizing all function and argument names to use _ instead of .) for version 1.0. And if I decide to do so, I will include this change as well (I will keep it open to remind me).

mattansb commented 3 years ago

@singmann This is somewhat related - when there are only between-s factors (and only one observation per subject) there really is no reason to have an input for id. I know it is not good practice to allow for the first argument to be missing while requireing others, but it is possible to have:

aov_ez <- function (id, dv, data, 
                    between = NULL, within = NULL, covariate = NULL, 
                    observed = NULL,
                    fun_aggregate = NULL, transformation, 
                    type = afex_options("type"), 
                    factorize = afex_options("factorize"),
                    check_contrasts = afex_options("check_contrasts"), 
                    return = afex_options("return_aov"), 
                    anova_table = list(), 
                    include_aov = afex_options("include_aov"), 
                    ..., 
                    print.formula = FALSE) {

  if (missing(id)) {
    # warning?
    data$.id_var <- seq_len(nrow(data))
    id <- ".id_var"
  }

  ...
}
singmann commented 3 years ago

First of all great to get back to this issue after a long time. Funny to read my thoughts from 5 years back even though I decided against implementing that.

Anyway, to get to your point @mattansb, I know that sometimes one just has between-subjects data and doesn't really need the participant identifier. However, I feel like that from a conceptual and pedagogical perspective, always requiring the user to specify the participant identifier is good. A data set should always have such a column to ensure that nothing goes wrong during data manipulation/preparation. From my teaching experience, it is not too difficult to explain that data always needs this column. However, data without a participant identifier can lead to problems.

What this means is that I am unlikely to accept changes that will enable this behaviour. In any way, if I were to be convinced, it would have to be added to aov_car() as well as this is the main function.

mattansb commented 3 years ago

Fair enough (:

In any case, regarding @dgromer original issue, using the native pipe (R >= 4.1.0), one can still pipe even without the . operator, with some creativity:

data(obk.long, package = "afex")

obk.long |> 
  dplyr::filter(gender == "F") |> 
  afex::aov_ez(id = "id", dv = "value", 
               between = "treatment", 
               within = c("phase", "hour"))
#> Anova Table (Type 3 tests)
#> 
#> Response: value
#>                 Effect     df   MSE        F  ges p.value
#> 1            treatment   2, 5 11.83     2.56 .240    .171
#> 2                phase  2, 10  5.08   4.64 * .197    .038
#> 3      treatment:phase  4, 10  5.08     2.40 .202    .119
#> 4                 hour  4, 20  2.16 7.66 *** .256   <.001
#> 5       treatment:hour  8, 20  2.16     0.28 .025    .963
#> 6           phase:hour  8, 40  0.97     1.21 .047    .316
#> 7 treatment:phase:hour 16, 40  0.97     0.60 .046    .864
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
#> 
#> Sphericity correction method: GG
singmann commented 3 years ago

The native pipe call works great. Thanks for showing that.

Can you elaborate a bit in which real data analysis situation omitting the id variables is really a tremendous benefit? I know that some example data does not have it, but I feel like real data basically always has it, so it does not seem like an actual problem to me.

mattansb commented 3 years ago

Nah, you were right - best practice would be to have an ID column.