larmarange / ggstats

Extension to ggplot2 for plotting stats
https://larmarange.github.io/ggstats/
GNU General Public License v3.0
29 stars 1 forks source link

Term order #63

Closed talgatomarov closed 4 months ago

talgatomarov commented 5 months ago

Thank you for your efforts on this package.

I am using ggcoef_table to visualize coefficients from a survival::coxph model. It works great. However, I've noticed that categorical terms are sorted based on their string values. For example, when I specify a factor variable with levels=c("0", "1", ">=2")), the terms are displayed in this order: "1", "0", ">=2". Is there a way to enforce the same order as in the factor levels?

I found a temporary workaround. Specifying categorical_terms_pattern="level={level_rank}; {level}" puts them in correct order. However, it does not seem like a clean solution.

Thank you.

larmarange commented 5 months ago

Would you have a reproductive example?

Did you check that you are using the last version of ggstat and of broom.helpers package?

talgatomarov commented 5 months ago

I am using ggstats=0.6.0 , survival=3.5-7 broom.helpers=1.15.0.

I've noticed that the issue occurs only when I am using a lot of predictor variables. Below, I generated some dummy data.

library(survival)
library(ggstats)

set.seed(15)

factor_values = c("0", "1", "2-3", "4-5", ">5")

data = as.data.frame(
    list(
        event=sample(c(0,1), replace=TRUE, size=100),
        time=sample(c(5, 10, 12, 50), replace=TRUE, size=100),
        factor_var1=factor(sample(factor_values, replace=TRUE, size=100), levels=factor_values),
        factor_var2=factor(sample(factor_values, replace=TRUE, size=100), levels=factor_values),
        factor_var3=factor(sample(factor_values, replace=TRUE, size=100), levels=factor_values),
        factor_var4=factor(sample(factor_values, replace=TRUE, size=100), levels=factor_values),
        factor_var5=factor(sample(factor_values, replace=TRUE, size=100), levels=factor_values),
        factor_var6=factor(sample(factor_values, replace=TRUE, size=100), levels=factor_values),
        factor_var7=factor(sample(factor_values, replace=TRUE, size=100), levels=factor_values),
        factor_var8=factor(sample(factor_values, replace=TRUE, size=100), levels=factor_values),
        factor_var9=factor(sample(factor_values, replace=TRUE, size=100), levels=factor_values),
        factor_var10=factor(sample(factor_values, replace=TRUE, size=100), levels=factor_values)
    )
)

model = survival::coxph(
    formula=Surv(time, event) ~ .,
    data=data
)

options(repr.plot.width=12, repr.plot.height=12)
ggcoef_table(
    model, 
    exponentiate=TRUE
)

This is my output when I include 10 factor variables.

example

This is my output when I include 9 factor variables (not included in code)

example2

larmarange commented 4 months ago

OK. I better understand your issue. The problem is coming from the fact that several variables share the same levels.

To better understand your initial issue, you could call ggcoef_model() with return_data = TRUE. You will see the dataset used by ggcoef_model() and ggcoef_table() to generate the plot. By default, the variable mapped to y axis is "label" and we use "var_label" to facet the plot by variable. "label" is transformed into a factor with forcats::fct_inorder(). When some terms are used by one variable but not by another variable, then the order is sometimes not preserved.