Open wobbrock opened 1 week ago
We thank Prof. Higgins for his commentary on our article. We clarify that the interaction interpretation issues (extensively discussed in our paper) are not the cause of ART’s clear failure in the example shown in Figure 1, where ART falsely detects (with very high confidence) both a main effect and an interaction effect that do not exist.
We also explain why the simulation model used by Prof. Higgins and his students in their early studies of ART is overly simplistic, failing to simulate distributions observed in real data and thereby masking ART's fundamental flaws. As we explain, "despite decades of its study and analysis by statisticians" (see comment by Prof. Wobbrock), ART's critical issues, as demonstrated by our experiments, have been largely overlooked. ART's alignment procedure assumes a very specific model structure, so when the generated data deviate from this model, the method fails. We encourage Prof. Higgins and Prof. Wobbrock (@wobbrock) to conduct their own experiments to verify our findings using their more recent simulation methodology, as described by Elikin et al. (2021). We suggest they extend their experiments to include strong non-null effects on secondary factors, as well as to account for ordinal data and discrete distributions.
We further discuss the interaction interpretation issues raised by Prof. Higgins, although our article already discusses them in depth. But we emphasize that those are irrelevant to the specific example in Figure 1. We conclude with our suggestions for revising our article to incorporate insights from this discussion.
In the article, we state: "Time performances have been randomly sampled from a population in which: (1) Difficulty has a large effect; (2) Technique has no effect; and (3) there is no interaction effect between the two factors." Specifically, we used the simulation methodology described in Section 4, setting $a_1 = 2$, $a2 = 0$, and $a{12} = 0$ (see Equation 2, where $a_1$, $a2$, and $a{12}$ represent the magnitudes of effect for Difficulty, Technique, and their interaction, respectively.
The fact that there is no main effect of Technique and no interaction effect in the population is crucial for understanding why Prof. Higgins' argumentation is misleading. We address both main and interaction effects.
Prof. Higgins does not comment on the fact that ART detects a main effect of Technique with an extremely low p-value. The discrepancy with the results of the other methods is huge and cannot be attributed to non-linear data transformations and interpretation issues. As we explain in the paper, monotonic transformations do not alter the interpretation of the null hypothesis in main effects. Unfortunately, as we demonstrate in Figure 2, ART is not a monotonic transformation, since it does not preserve the original order of data values.
Isn't reasonable to say here that ART clearly fails? Notice that this is not an isolated case. If we repeatedly generate data from such populations for this experimental design, we will observe the Type I errors shown in Figure 10 (Log-normal). Sadly, ART's failure is systematic.
Prof. Higgins argues that the discrepancy between the p-values obtained with ART and three other transformation methods (LOG, RNK, and INT) are due to the non-linear transformations that those methods apply. He disregards however the fact that the data for Figure 1 were sampled from a population where there is a main effect on a single factor (Difficulty) and no interaction. It is easy to show that in this case, a monotonic (even non-linear) transformation will not affect the definition of the null hypothesis for the interaction, that is, there will be an interaction effect in the transformed data if and only if there is an interaction effect in the observed non-transformed data. In other words, the "definition of interaction" is not an issue here.
Interaction interpretation issues would only emerge if a strong effect appeared on both factors. Contrary to what Prof. Higgins states ("... they misunderstood what interaction is fundamentally about"), our paper provides an extensive analysis of interaction interpretation issues. But those issues are irrelevant to the example of Figure 1, and thus cannot explain the discrepancy between the results of ART and the results of the other transformation methods. We had carefully chosen our illustrative example to avoid such misunderstandings.
Prof. Higgins believes that "there IS interaction in the data" but this is not the case simply because there was no interaction effect in the original population. The observed patterns are random fluctuations. If we repeat the data generation process multiple times, we will observe that there is no winning technique, regardless of which difficulty level we focus on. The lower time observed for Technique C (in red) under Level4 is a random result that can be easily explained by the high variability of the data points. The following figure presents means from six different samples from the same population, where we clearly see that the interaction trends are completely random:
If we now generate a large number of samples and use the various methods to test the interaction, we will observe the Type I error rate trends shown in our Figure 12 (Log-normal). ART is the only method that inflates error rates. There is no ambiguity on how we define interactions in this case. Once again, ART simply fails.
Prof. Higgins' clarifications help understand why older simulation studies failed to reveal ART’s challenges with heavy-tailed distributions. As Higgins explains, his stimulation studies used the model:
$Y = \mu + a_i + bj +ab{ij} + error$
where non-normal distributions were simply applied to the error term. Mansouri and Chang (1995) appear to have used the same approach. Their proof (see their Section 2) assumes this very specific model structure (as well as continuous distributions). Unfortunately, this approach produces naive distributions for the response variable $Y$ that bear little resemblance to the actual data generation processes commonly observed in behavioral sciences and HCI.
Suppose we studied a selection task, measuring the time participants needed to acquire a target using four different input techniques. Higgins' simulation model would generate distribution shapes for $Y$ (time in seconds here) similar to those shown below:
We observe that the distribution of slower techniques is simply shifted to the right, while their shape conveniently remains identical. We agree that ART will behave correctly in this scenario. However, how realistic and interesting are these distributions?
Heavy-tailed distributions, such as the log-normal distribution here, commonly arise in nature when measurements cannot be negative or fall below a certain threshold (Limpert et al., 2001), e.g., the minimum time needed to react to a visual stimulus. In most experimental research, however, this threshold does not shift across conditions while preserving the distribution’s shape. Instead, distributions are more likely to resemble the following:
In these distributions, the mean for slower techniques also increases, but this increase is not reflected as a simple global shift in the distribution. Instead, the overall shape of the distribution changes. The model for these distributions is structured as follows:
$log(Y - \theta) = \mu + a_i + bj + ab{ij} + error$
where the error term is normally distributed, and $\theta$ represents a threshold below which response values cannot occur. Although this threshold may not be zero in certain experimental scenarios (Wagenmakers et al., 2007), we simplify our analysis in the article by setting $\theta = 0$. And we show that ART inflates Type I error rates for such distributions, as it tends to confound effects.
Our readers may note that Elkin et al. (2021) diverged from the modeling approach described by Prof. Higgins. Their methodology (illustrated by their example in Section 3 and their data generation procedures in Section 5.1) aligns closely with our own approach. So why did Elkin et al. (2021) fail to identify the issues we raise in our article? There are two reasons:
Their experiments assessed Type I error rates only in scenarios where main effects on all factors were null. Consequently, the authors couldn’t observe how ART confounds effects under skewed distributions.
They tested only continuous distributions, missing ART’s failure in handling discrete distributions, such as binomial distributions and Likert-type data. Prof. Wobbrock and Prof. Higgins do not comment on our results on ordinal and discrete distributions, nor the limited past evaluations of ART on such data. We found no prior assessments of ART on binomial distributions, and Lüpsen’s (2017) warnings about ART’s failure with discrete scales of few levels have been largely overlooked by HCI researchers, who frequently use ART specifically for this type of data.
Furthermore, Elkin et al. (2021) compared ART only to the t-test. Had they included INT (or even RNK) in their evaluations, they would have found that this much simpler method has greater power than ART in the scenarios they tested.
As Myers et al. (2012) explain, "if the underlying distribution of the response variable is not normal, but is a continuous, skewed distribution, such as the lognormal, gamma, or Weibull distribution, we often find that the constant variance assumption is violated," and in such cases, "the variance is a function of the mean" (Pages 54-55). This pattern frequently occurs in studies measuring task-completion times, whether the task is visual, motor, or cognitive. As tasks become more difficult and prolonged, variance tends to increase. Similarly, slower participants generally exhibit greater variance across tasks compared to faster, more practiced users. In our article, we reference Wagenmakers et al. (2007), who conducted an analysis of nine independent experiments and demonstrated that the standard deviation of response times is proportional to the mean.
Our modeling approach, as described earlier, aligns with these observations. Even when variances are equal in the latent space, the standard deviations of observed log-normal distributions increase linearly with the mean. This raises an important question though: Does ART’s struggle with continuous skewed distributions stem solely from heteroscedasticity issues?
For example, suppose the time distributions for the input techniques in our previous example had the following shapes:
While these distributions appear similar to those in the previous figure, all variances are now equal. Would ART perform correctly under these conditions? The answer is no, as we will now discuss.
Developing a unified simulation approach for mixed-effects models is more challenging with such distributions, as it is not straightforward to adapt our data generation method to ensure equal variances in the response distributions. However, we conducted additional experiments (not included in the paper or appendix) specifically for log-normal distributions, where we directly controlled the population parameters of the observed distributions (rather than those of the latent variable) to ensure constant variances.
In a 4x3 repeated measures design, with a strong effect on the first factor, no effect on the second factor, and no interaction effect, we again observed that ART inflates Type I error rates for both the main effect and the interaction effect. The issue becomes more serious as the common variance increases. In contrast, INT and RNK do not exhibit such problems.
As an example, we provide this dataset generated with this process. Data will look as follows:
Once again, ART is the only method to find strong evidence for a Technique effect, with $p = .00073$ (compared to $p = .23$ using INT), and some weaker evidence for an interaction effect, with $p = .014$ (compared to $p = .21$ using INT).
Thus, heteroscedasticity alone cannot explain ART's failures. ART’s alignment mechanism is constructed for a very specific model structure (see Prof. Higgins’ model). When the data generation process deviates from this structure, the method breaks down.
Prof. Higgins presents an example to illustrate how rank transformations distort interactions when both factors have effects. This example closely resembles our example in Section 3 (see Figure 7), where we extensively discuss how non-linear transformations can fundamentally alter the shape of interactions.
However, focusing solely on the issues with RNK, which we also visually demonstrate in Figure 5, is like not seeing the forest for the trees. As we state in the article, "non-linear transformations come into play in various ways in experimental designs." We reference Loftus (1978) and Wagenmakers et al. (2012), who observe that researchers are largely unaware that many reported interactions fail to provide meaningful insights into the underlying phenomena of interest. It is worth noting that much of this work originates from psychologists, who aim to model latent psychological processes rather than merely abstract numerical relationships.
Consider the following model:
$f(Y) = \mu + a_i + bj + ab{ij} + error$
where $f$ is a monotonic but not necessarily a linear function. We discussed earlier that ART fails for such models for reasons unrelated to interaction interpretation issues. But the question remains: How should interaction effects be defined in this context? Should they be based on the interaction patterns observed in the responses, or on whether $ab_{ij}$ is zero?
Prof. Higgins suggests that the correct interpretation is the former, but this assumption cannot be taken for granted. How meaningful is it to declare interaction effects that arise solely from parallel main effects and lack any theoretical interpretation? More sophisticated analysis methods, such as those based on generalized linear models or Bayesian modeling, aim to estimate the parameters of a model that best describes the data generation process. In this case, an interaction effect would be declared if there was sufficient evidence to show that $ab_{ij}$ is non-zero. For instance, see our analyses for ordinal data, where the goal is to make inferences based on the parameters of a continuous latent psychological variable (Liddell and Kruschke, 2018).
When $f$ is continuous, RNK and INT are not affected by its presence on the above model and will still interpret interactions based on $ab_{ij}$. However, RNK, and to a lesser degree INT, have other issues when strong parallel main effects appear. These issues are discussed and demonstrated in our results (e.g., see the Type I error rates for these methods in Figure 13). Furthermore, if $f$ is not continuous --- e.g., when it leads to ordinal scales with few items --- the above interpretation of interaction breaks down for these methods as well (see our example in Figure 7, right).
The existing literature does not clarify how ART is expected to interpret interactions in these scenarios. Early evaluations of ART did not consider such models (see above), so the question was not even posed. However, Elkin et al. (2021) who do study such models, use log-transformed values as their baseline for log-normal distributions (see their example on Page 4). It is clear that Elkin et al. assume ART's conclusions should align with the parameters of the latent variable, as we do. Unfortunately, their evaluations focus on contrasts and do not account for parallel effects, avoiding interaction interpretation issues.
In summary, we believe that defining the null hypothesis for interactions as we have done is the most reasonable approach. Comparisons using an alternative method, where interactions are defined based on observed interaction patterns rather than the parameters of the data generation model, are not even feasible for such models. How would one define the ground truth in terms of the parameters of the models? And why would such an approach make sense?
We emphasize that we caution readers that our results for interactions in the presence of parallel main effects require careful interpretation due to these interaction interpretation issues. For example, see our detailed discussion following the results in Figure 13. However, our findings also demonstrate that assessing interactions in the presence of strong main effects is generally problematic, not just for ART. Please refer to our recommendations in Section 7 for guidance on addressing these issues.
The current version of our article does not clearly explain why our results contradict those of earlier evaluations of ART. We acknowledge that the earlier findings on continuous skewed distributions were puzzling to us until Prof. Higgins' commentary helped clarify this discrepancy. To address these new insights, we propose the following revisions:
In Section 2 (Background), we will discuss the model assumptions of previous simulation studies and highlight their shortcomings. We will also clarify how our experimental approach differs from these studies.
In Section 3 (Interpreting Effects), we will incorporate elements of our analysis above regarding interaction interpretation issues.
Our Abstract and Conclusions currently state that "ART operates as expected only under normal distributions with equal variances." We will revise these statements to clarify that ART functions correctly only when non-normal data are generated using a highly specific model structure, which fails to handle a broad range of real-world data distributions.
We will include a new experimental section in the Appendix to present additional results on log-normal distributions with constant variance (see our earlier discussion on heteroscedasticity issues). We will discuss these additional findings in the article.
Finally, we encourage the authors of ART to conduct their own experiments to verify our results. ART is used for the analysis of hundreds of studies every year, so it is crucial to promptly inform researchers about the risks associated with this method. ART was intended to correct specific issues with RNK but introduced a new set of more serious problems.
Elkin, Lisa A., Matthew Kay, James J. Higgins, and Jacob O. Wobbrock. 2021. “An Aligned Rank Transform Procedure for Multifactor Contrast Tests.” In The 34th Annual ACM Symposium on User Interface Software and Technology, 754–68. UIST ’21. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3472749.3474784.
Higgins, James J., R. Clifford Blair, and Suleiman Tashtoush. 1990. “The Aligned Rank Transform Procedure.” In Conference on Applied Statistics in Agriculture. https://doi.org/10.4148/2475-7772.1443.
Mansouri, H., and G.-H. Chang. 1995. “A Comparative Study of Some Rank Tests for Interaction.” Computational Statistics & Data Analysis 19 (1): 85–96. https://doi.org/10.1016/0167-9473(93)E0045-6.
Myers, Raymond H., Montgomery, Douglas, Vining, G. Geoffrey et al. Generalized Linear Models : With Applications in Engineering and the Sciences: Second Edition. John Wiley and Sons Inc., 2012. https://doi.org/10.1002/9780470556986
Liddell, Torrin M., and John K. Kruschke. 2018. “Analyzing Ordinal Data with Metric Models: What Could Possibly Go Wrong?” Journal of Experimental Social Psychology 79: 328–48. https://doi.org/10.1016/j.jesp.2018.08.009.
Limpert, Eckhard, Werner A. Stahel, and Markus Abbt. 2001. “Log-normal Distributions across the Sciences: Keys and Clues.” BioScience 51 (5): 341–52. https://doi.org/10.1641/0006-3568(2001)051[0341:LNDATS]2.0.CO;2.
Loftus, Geoffrey R. 1978. “On Interpretation of Interactions.” Memory & Cognition 6 (3): 312–19. https://doi.org/10.3758/BF03197461.
Lüpsen, Haiko. 2017. “The Aligned Rank Transform and Discrete Variables: A Warning.” Communications in Statistics - Simulation and Computation 46 (9): 6923–36. https://doi.org/10.1080/03610918.2016.1217014.
Wagenmakers, Eric-Jan, and Scott Brown. 2007. “On the Linear Relation Between the Mean and the Standard Deviation of a Response Time Distribution.” Psychol Rev 114 (3): 830–41. https://doi.org/10.1037/0033-295X.114.3.830.
Wagenmakers, Eric-Jan, Angelos-Miltiadis Krypotos, Amy H Criss, and Geoff Iverson. 2012. “On the Interpretation of Removable Interactions: A Survey of the Field 33 Years After Loftus.” Mem Cognit 40 (2): 145–60. https://doi.org/10.3758/s13421-011-0158-0.
Conflicts of interest
Reviewed version
8ada861
Review
I corresponded with one of ART's original statistician authors, James J. Higgins, to see what he'd make of this submission. He sent the attached comments, which pretty swiftly show the authors have made some fundamental errors in their approach. With Higgins' permission, I include his comments here:
Higgins - ART rebuttal.pdf
Openness/Transparency
The paper provides data and is adequate for others to reproduce analyses.
Submission categories
Suggested outcome
Reject: this paper cannot be fixed to the point where I would endorse it.
Requested changes
There really aren't changes that can salvage this paper because the analysis is flawed. The authors believe they have found ART to be essentially incorrect, despite decades of its study and analysis by statisticians, both empirically and theoretically. Perhaps not surprisingly, as Higgins' brief review shows, the authors misunderstand interaction in a way that causes their critique to fail.
ORCID
0000-0003-3675-5491