strengejacke / ggeffects

Estimated Marginal Means and Marginal Effects from Regression Models for ggplot2
https://strengejacke.github.io/ggeffects
Other
553 stars 35 forks source link

Colors not showing correctly when specifiying "show_data = T" in the plot method of ggpredict object #404

Open lukasla opened 11 months ago

lukasla commented 11 months ago

Hi,

I encountered a strange bug when plotting data with 4 levels (probably also happens with more). The second and/or third color in the vector of specified colors is set to some other arbritary color when "show_data = T", when "show_data = F" everything is as expected.

Here is an example (which statistically doesn't make sense, just to show what I mean):

library(ggeffects)
library(splines)
data(efc)

fit <- lm(barthtot ~ c12hour  * c161sex + e42dep, data = efc)

pred<-ggpredict(fit, terms = c("c161sex","c12hour [4,35,77,168]"))
plot(pred,show_data = T, color=c("purple","green","blue","red")) # here the colors do not match the input colors - for second and third color

plot(pred,show_data = F, color=c("purple","green","blue","red")) # here the colors are as expected 

Not a major issue just something that I spent some time with before realizing its actually a bug not something I did ;-)

Thanks for providing such a great tool! Lukas

strengejacke commented 11 months ago

This one is tricky, indeed. If the 2nd variable in terms is continuous, you may have many more values in the data than shown in the "grouped" predictions. In your example, you see predicted values for the values 4, 35, 77 and 168 of c12hour. However, the raw data for c12hour contains much more different values, and thus, the dots receive a gradient color, depending on how "close" the dots (i.e. the data values) are to the requested values (4, 35, 77 and 168). Thus, your provided color scale is passed to ggplot2::scale_color_gradient(), and therefore, the colors look different from their original color codes. If you don't show data points, there's no need for gradient color scale, and thus, colors are perfectly matching. Same when you have categorical variables as 2nd term. Since all categories are present in the data, colors will perfectly match.

library(ggeffects)
data(efc)
efc <- datawizard::to_factor(efc, c("e42dep", "c161sex"))
fit <- lm(barthtot ~ c161sex * e42dep, data = efc)
pred <- ggpredict(fit, terms = c("c161sex", "e42dep"))

plot(pred, color = c("purple", "green", "blue", "red"))

plot(pred, show_data = TRUE, jitter = TRUE, color = c("purple", "green", "blue", "red"))

Created on 2023-11-21 with reprex v2.0.2