Changing distribution parameter does not change anything when group by is applied

gogonzo commented 2 years ago

When group by is applied distribution parameters seem not to do anything. What I understand it only controls curve representing theoretical distribution which disappears when group_by is applied.

devtools::load_all("~/nest/teal/")
devtools::load_all("~/nest/teal.devel/")
devtools::load_all("~/nest/teal.modules.general/")
library(scda)
ADSL <- synthetic_cdisc_data("latest")$adsl
#'
vars1 <- choices_selected(variable_choices(ADSL, c("ARM", "COUNTRY", "SEX")), selected = NULL)
#'
app <- init(
  data = cdisc_data(
    cdisc_dataset("ADSL", ADSL),
    code = "ADSL <- synthetic_cdisc_data(\"latest\")$adsl",
    check = FALSE
  ),
  modules = root_modules(
    tm_g_distribution(
      dist_var = data_extract_spec(
        dataname = "ADSL",
        select = select_spec(
          choices = variable_choices(ADSL, c("AGE", "BMRKR1")),
          selected = "BMRKR1",
          multiple = FALSE,
          fixed = FALSE
        )
      ),
      strata_var = data_extract_spec(
        dataname = "ADSL",
        filter = filter_spec(
          vars = vars1,
          multiple = TRUE
        )
      ),
      group_var = data_extract_spec(
        dataname = "ADSL",
        filter = filter_spec(
          vars = vars1,
          multiple = TRUE
        )
      )
    )
  )
)
shinyApp(app$ui, app$server)

denisovan31415 commented 2 years ago

it does change the x-axis range:

Screen Shot 2022-01-24 at 11 53 03 AM

denisovan31415 commented 2 years ago

@Polkas can you confirm?

Polkas commented 2 years ago

For some distributions changes in parameters are neutral for quantile-quantile plot. if you change the type for lognormal then it will be still neutral for mean updates where as when we update the sd the plot will be updated (truly).

gogonzo commented 2 years ago

@Polkas Hmmm, not sure if it should be the same. If You change the distribution which you'd like to compare to it means that the shape and shift changes - so the QQ plot will not look the same each time.

Consider simple example:

We have a fitted theoretical distribution with mean and sd same as BMRKR1 (grey)
And we have a custom theoretical distribution with sd=1 (blue)

Above is an output from th_g_distribution and in my opinion it's bad. Why? Look at the comparison between data and blue line and QQ plot. The data and blue line does not follow similar distribution shape, so QQ plot should look way different. And with this axis change it's even worse as it's neither fitted nor custom.

In my opinion the QQ plot should be either/or:

empirical vs fitted theoretical (grey) and distribution parameters should be hidden as fitted depends on a data. And custom parameters shouldn't change the scale because why would they if it's fitted.
empirical vs custom theoretical (blue) and parameters should be shown.

This is how I see QQ plot data vs custom theoretical

library(scda)
library(ggplot2)
library(ggmosaic)
ADSL <- synthetic_cdisc_data("latest")$adsl

true_params <- data.frame(mean = mean(ADSL$BMRKR1), sd = mean(ADSL$BMRKR1))
theoretical_params <- true_params

ggplot(ADSL, aes_string(sample = "BMRKR1")) +
  stat_qq(distribution = "qnorm", dparams = true_params) +
  stat_qq_line(distribution = "qnorm", dparams = theoretical_params) +
  ggplot2::labs(x = "theoretical", y = "sample") +
  ggplot2::theme_gray()

theoretical_params$sd <- 10
ggplot(ADSL, aes_string(sample = "BMRKR1")) +
  stat_qq(distribution = "qnorm", dparams = true_params) +
  stat_qq_line(distribution = "qnorm", dparams = theoretical_params) +
  ggplot2::labs(x = "theoretical", y = "sample") +
  ggplot2::theme_gray()

Polkas commented 2 years ago

I will start with defining the main objective for qqplot here, which is to identify if the plot follows certain distribution. In this case we are mostly not interested about parameters of the distribution. For most of the distributions updating the theoretical parameter will not influence the view of qqplot. On the other hand the density plot is used to inspect these parameters. In the current state we only want to show how the sample is looking when comparing to some other distribution, e.g. inspect the tails.

The input for parameters (interactive input panel) was created as we do not know true_params. The input is inherited by qqplot and which is not a main consumer. People could try to manipulate the starting values provided by MASS::fitdistr, for gaussian distribution simple mean and sd of sample. Sample estimators like mean is not a true value of the parameter, till you not have the whole population.

Your decision to use constant sample estimators for qqplot and give people space ONLY TO UPDATE THE LINE is not an obvious attitude. SO people in your case will test how some custom parameters setup is close to the sample estimates. Where as people do not know the true parameters and want to find them. And as i said qqplot is not design here for it.

Could you provide some use case scenario of your proposition?

gogonzo commented 2 years ago

I will start with defining the main objective for qqplot here, which is to identify if the plot follows certain distribution. In this case we are mostly not interested about parameters of the distribution.

If not interested then parameters when switching to QQ plot are not needed and also not needed at all when group by is applied.

For most of the distributions updating the theoretical parameter will not influence the view of qqplot.

Not true, when you change the parameters of the theoretical distribution QQ plot changes.

The input for parameters (interactive input panel) was created as we do not know true_params. The input is inherited by qqplot and which is not a main consumer.

Could you provide some use case scenario of your proposition?

I gave two propositions and I think 1-st is more relevant indeed. I made (2) just to show how changing distribution parameters changes the look of the QQ plot.

empirical vs fitted theoretical (grey) and distribution parameters should be hidden as fitted depends on a data. And custom parameters shouldn't change the scale because why would they if it's fitted.
empirical vs custom theoretical (blue) and parameters should be shown.

Polkas commented 2 years ago

So then precisely about "empirical vs fitted theoretical (grey) and distribution parameters should be hidden as fitted depends on a data. And custom parameters shouldn't change the scale because why would they if it's fitted."

This is not true for lognormal distribution (sd parameter) and gamma (shape parameter). Thus this input panel is still valuable although not for all scenarios.

The UI should be as stable if possible to not brings bad UX. Moreover we should assume our end users know what they do.

Something I could accept/conider will be to use disable option for some parameters inputs under specific distribution and TAB (qqplot)

gogonzo commented 2 years ago

Ok, I'm not forcing to address or consider this in UAT, we can postpone this but I'm keeping my opinion that there is something to change. My issues is:
We have parameters to change the custom theoretical arguments in QQ plot which change only x-axis scale keeping reference line the same. This is misleading because I think (and someone else might also) that data follows custom distribution which shape and position might be totally different than data. Either scale or line should be fixed.

library(scda)

ADSL <- synthetic_cdisc_data("latest")$adsl
sample <- ADSL$BMRKR1

plot(
  qnorm(p = seq(0, 1, length.out = length(sample)), 
        mean = mean(sample), 
        sd = sd(sample)),
  sort(sample)
);abline(a = 0, b = 1)

plot(
  qnorm(
    p = seq(0, 1, length.out = length(sample)),
    mean = mean(sample) + 2,
    sd = sd(sample)
  ),
  sort(sample)
)
abline(a = 0, b = 1)

plot(
  qnorm(
    p = seq(0, 1, length.out = length(sample)),
    mean = mean(sample) + 2,
    sd = sd(sample) + 1
  ),
  sort(sample)
); abline(a = 0, b = 1)

Light Dark Dark

I'm reopening the issue with discussion label

insightsengineering / teal.modules.general

Changing distribution parameter does not change anything when group by is applied #322