Open dgromer opened 9 years ago
I think the reason for having both id
and dv
before data
was to clearly separate those required arguments from the two more or less optional arguments between
and within
. But I see that even without dplyr
it could make sense to have data
as first argument, e.g., when using lapply
. On the other hand, this design would easily allow to run ANOVAs on many subsets of the data which should in principle only exacerbate the existing problem of Type I error accumulation (which exists for any multifactor ANOVA).
I also have just recently made some rather strong changes to the interface so I am not sure now is the time for the next ones, but it is any idea I will keep in mind. Perhaps only adding an alternative version with changed ordering of arguments could work for the time being, e.g., aov_ez2
.
Having data
as the first argument seems more intuitive to me, since then all following arguments clearly refer to this data frame. Right now, the first two arguments are somewhat out of context in my opinon.
And aov_ez
would be more similar to ez::ezANOVA
;)
However, I wouldn't include the aov_ez2
wrapper, because it could make things more complicated for people starting with the package.
I am somewhat inclined to make this change. The only problem is, this will really break a lot of existing code using this function, so it would be quite a big change.
I have some plans for making some rather drastic changes (e.g., harmonizing all function and argument names to use _
instead of .
) for version 1.0. And if I decide to do so, I will include this change as well (I will keep it open to remind me).
@singmann This is somewhat related - when there are only between-s factors (and only one observation per subject) there really is no reason to have an input for id
.
I know it is not good practice to allow for the first argument to be missing while requireing others, but it is possible to have:
aov_ez <- function (id, dv, data,
between = NULL, within = NULL, covariate = NULL,
observed = NULL,
fun_aggregate = NULL, transformation,
type = afex_options("type"),
factorize = afex_options("factorize"),
check_contrasts = afex_options("check_contrasts"),
return = afex_options("return_aov"),
anova_table = list(),
include_aov = afex_options("include_aov"),
...,
print.formula = FALSE) {
if (missing(id)) {
# warning?
data$.id_var <- seq_len(nrow(data))
id <- ".id_var"
}
...
}
First of all great to get back to this issue after a long time. Funny to read my thoughts from 5 years back even though I decided against implementing that.
Anyway, to get to your point @mattansb, I know that sometimes one just has between-subjects data and doesn't really need the participant identifier. However, I feel like that from a conceptual and pedagogical perspective, always requiring the user to specify the participant identifier is good. A data set should always have such a column to ensure that nothing goes wrong during data manipulation/preparation. From my teaching experience, it is not too difficult to explain that data always needs this column. However, data without a participant identifier can lead to problems.
What this means is that I am unlikely to accept changes that will enable this behaviour. In any way, if I were to be convinced, it would have to be added to aov_car()
as well as this is the main function.
Fair enough (:
In any case, regarding @dgromer original issue, using the native pipe (R >= 4.1.0), one can still pipe even without the .
operator, with some creativity:
data(obk.long, package = "afex")
obk.long |>
dplyr::filter(gender == "F") |>
afex::aov_ez(id = "id", dv = "value",
between = "treatment",
within = c("phase", "hour"))
#> Anova Table (Type 3 tests)
#>
#> Response: value
#> Effect df MSE F ges p.value
#> 1 treatment 2, 5 11.83 2.56 .240 .171
#> 2 phase 2, 10 5.08 4.64 * .197 .038
#> 3 treatment:phase 4, 10 5.08 2.40 .202 .119
#> 4 hour 4, 20 2.16 7.66 *** .256 <.001
#> 5 treatment:hour 8, 20 2.16 0.28 .025 .963
#> 6 phase:hour 8, 40 0.97 1.21 .047 .316
#> 7 treatment:phase:hour 16, 40 0.97 0.60 .046 .864
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
#>
#> Sphericity correction method: GG
The native pipe call works great. Thanks for showing that.
Can you elaborate a bit in which real data analysis situation omitting the id variables is really a tremendous benefit? I know that some example data does not have it, but I feel like real data basically always has it, so it does not seem like an actual problem to me.
Nah, you were right - best practice would be to have an ID column.
Is there a specific reason why
id
is the first argument inaov_ez
, anddata
is the third?It would make more sense to me if
data
was the first argument, because then it would fit nicely into pipelines likeinstead of now