rvlenth / emmeans

Estimated marginal means
https://rvlenth.github.io/emmeans/
340 stars 30 forks source link

better support for quantile regression? #458

Closed yuryzablotski closed 7 months ago

yuryzablotski commented 7 months ago

Support for quantile regressions is only possible for tau = 0.5 (median regression), however, the whole point of quantile regression is to be able to model quantiles:

works

library(tidyverse) d <- mtcars %>% mutate(cyl = factor(cyl)) library(quantreg) qm <- rq(mpg ~ cyl, d, tau = 0.5) library(emmeans) emmeans(qm, pairwise ~ cyl, type = "response")

doesnt work

qm10 <- rq(mpg ~ cyl, d, tau = 0.1) emmeans(qm10, pairwise ~ cyl, type = "response")

Thanks forward!

rvlenth commented 7 months ago

The design is basically that you have to specify tau unless it is 0.5

> emmeans(qm10, pairwise ~ cyl, tau = 0.10, type = "response")
$emmeans
 cyl emmean    SE df lower.CL upper.CL
 4     21.5 0.083 29    21.33     21.7
 6     17.8 0.312 29    17.16     18.4
 8     10.4 2.134 29     6.04     14.8

Confidence level used: 0.95 

$contrasts
 contrast    estimate    SE df t.ratio p.value
 cyl4 - cyl6      3.7 0.323 29  11.456  <.0001
 cyl4 - cyl8     11.1 2.135 29   5.199  <.0001
 cyl6 - cyl8      7.4 2.156 29   3.432  0.0050

P value adjustment: tukey method for comparing a family of 3 estimates 

Maybe this isn't the best design, but it is documented.

rvlenth commented 7 months ago

I do agree it's better to change default to tau = object$tau[1], so that will be the case in the next update

> emmeans(qm10, "cyl")
 cyl emmean    SE df lower.CL upper.CL
 4     21.5 0.083 29    21.33     21.7
 6     17.8 0.312 29    17.16     18.4
 8     10.4 2.134 29     6.04     14.8

Confidence level used: 0.95 
yuryzablotski commented 7 months ago

that's fantastic. I did not know I can specify tau. That already helps enormously. looking forward for every update of emmeans! Thanks a lot!

rvlenth commented 7 months ago

I decided to go the extra mile and get this to work with models where we specify more than one tau, and treat it like like a multivariate model.

> mod = rq(mpg ~ cyl, tau = c(.25, .5, .75), data = d)
Warning message:
In rq.fit.br(x, y, tau = tau, ...) : Solution may be nonunique

> emmeans(mod, ~ cyl | tau)
tau = 0.25:
 cyl emmean    SE df lower.CL upper.CL
 4     22.8 1.417 29     19.9     25.7
 6     18.1 0.734 29     16.6     19.6
 8     14.3 1.310 29     11.6     17.0

tau = 0.50:
 cyl emmean    SE df lower.CL upper.CL
 4     26.0 1.872 29     22.2     29.8
 6     19.7 0.895 29     17.9     21.5
 8     15.2 0.873 29     13.4     17.0

tau = 0.75:
 cyl emmean    SE df lower.CL upper.CL
 4     30.4 2.433 29     25.4     35.4
 6     21.0 0.656 29     19.7     22.3
 8     16.4 1.092 29     14.2     18.6

Confidence level used: 0.95 

An interesting issue here is that we cannot get covariances between different values of tau, making it impossible to estimate SEs of contrasts involving different tau values

> contrast(.Last.value, "consec", by = NULL)
 contrast                    estimate   SE df t.ratio p.value
 cyl6 tau0.25 - cyl4 tau0.25     -4.7 1.60 29  -2.946      NA
 cyl8 tau0.25 - cyl6 tau0.25     -3.8 1.50 29  -2.530      NA
 cyl4 tau0.5 - cyl8 tau0.25      11.7   NA 29      NA      NA
 cyl6 tau0.5 - cyl4 tau0.5       -6.3 2.08 29  -3.036      NA
 cyl8 tau0.5 - cyl6 tau0.5       -4.5 1.25 29  -3.598      NA
 cyl4 tau0.75 - cyl8 tau0.5      15.2   NA 29      NA      NA
 cyl6 tau0.75 - cyl4 tau0.75     -9.4 2.52 29  -3.730      NA
 cyl8 tau0.75 - cyl6 tau0.75     -4.6 1.27 29  -3.610      NA

P value adjustment: mvt method for 6 tests 
Warning message:
In cov2cor(vcov(object)) :
  diag(.) had 0 or NA entries; non-finite result is doubtful

Obtaining NAs for the SE is deliberate, but this reveals a different issue with the mvt adjustment. If we'd asked for a different adjustment, we'd have gotten P values where they exist. Alas, always something more to work on.

yuryzablotski commented 7 months ago

Treating tau as an element of interaction is an elegant solution. I think emmeans does the same for multinomial models, right? Well, the contrasts and p.values are very valuable for medical science, the field I work in. Thus, getting them one day for quantile regression would be very useful. By the way, I also unfortunately wasn't able to get the interactions with quantiles other than 0.5 working. Hier is the Error message:

emmeans(qm90, pairwise ~ cyl|am, type = "response", tau = 0.9)

"Error in base::backsolve(r, x, k = k, upper.tri = upper.tri, transpose = transpose, : singular matrix in 'backsolve'. First zero in diagonal [3] In addition: Warning message: In summary.rq(object, covariance = TRUE, ...) : 12 non-positive fis"

If possible, having interactions working for different quantiles would also be magnificent and very useful.

Thanks a lot for your efforts!

yuryzablotski commented 7 months ago

One more thought if I may: the contrasts between different tau values may not be as important as the contrasts between categories at one particular tau. So, if every tau could be seen as separate model (which they are) and would only get contrasts for predictors independently of other taus, that would actually be enough. The contrasts between tau values would be a nice bonus, when it works one day.

rvlenth commented 7 months ago

I doubt that will ever work, due to the design of that package which has it filling yhe model for one tau at a time.

rvlenth commented 7 months ago

Your problem with qm90 is not due to emmeans().

> summary(qm90)

Call: rq(formula = mpg ~ cyl * am, tau = 0.9, data = d)

tau: [1] 0.9

Coefficients:
            coefficients   lower bd       upper bd      
(Intercept)   2.440000e+01   2.410198e+01  1.797693e+308
cyl6         -3.000000e+00 -1.797693e+308  1.797693e+308
cyl8         -5.700000e+00 -1.797693e+308  -4.903360e+00
am            9.500000e+00 -1.797693e+308   1.039295e+01
cyl6:am      -9.900000e+00 -1.797693e+308  1.797693e+308
cyl8:am      -1.240000e+01  -1.270135e+01  1.797693e+308
Warning message:
In rq.fit.br(x, y, tau = tau, ci = TRUE, ...) : Solution may be nonunique

Note that several of the CIs are basically $(-\infty,\infty)$. This is a very ill-conditioned system. You are trying to estimate tail quantiles with a very small amount of data and too many parameters.

yuryzablotski commented 7 months ago

You are right! I tried with more data and it worked:

library(ISLR) qm90 <- rq(wage ~ jobclass * health_ins, Wage, tau = 0.9) emmeans(qm90, pairwise ~ jobclass|health_ins, type = "response", tau = 0.9)

That's fantastic! Thank you very much!

rvlenth commented 7 months ago

The revised code is now on GitHub. In this latest version, the tau argument is not even there anymore, as it gets tau from those used in the model. You may use at to subset them. Also note you can pass arguments for summary.rq

> emmeans(mod, ~ cyl|tau, se = "boot", at = list(tau = .25))
tau = 0.25:
 cyl emmean    SE df lower.CL upper.CL
 4     22.8 1.576 29     19.6     26.0
 6     18.1 0.914 29     16.2     20.0
 8     14.3 1.569 29     11.1     17.5

Confidence level used: 0.95 

Note the SEs differ from what I showed before for mod because of se = "boot"

I think I've completed this issue, so am closing.

mattmoo commented 6 months ago

I could make this a new issue, but this might be the best place to raise it. I was just forced to upgrade a VM running a complex analysis pipeline that includes a multivariable quantile regression gradient analysis at quantiles 0.01-0.99 on 6700 observations with six covariates. It failed at the emmeans/emtrends function, and it took me a while to work out why.

I think that the new tau input is pretty cool. That said, it would be good to include a warning that the tau argument is deprecated, that was not clear to me.

Of more concern (to me) is that something seems to have changed with how emtrends does its calculations. Obviously my reference grid is large, but emtrends at a single quantile used to run on a machine with 64GB RAM, and now it doesn't. I haven't dived into it, but is there a chance that it's generating a reference grid for all quantiles, regardless of what is specified in at?

rvlenth commented 6 months ago

I think that the new tau input is pretty cool. That said, it would be good to include a warning that the tau argument is deprecated, that was not clear to me.

Well, I put that in the NEWS file, and updated the documentation. That's a fairly conventional way of communicating when changes are made, especially when it is a welcome improvement (which I think it is, largely). My sense is that a relatively small fraction of of users do quantile regression, so I didn't want to put it in a splashy startup message. (BTW, I fixed the formatting of the last part of that section of the models vignette.)

If I understand it, you wanted to those trend estimates for tau = 0.01, 0.02, ..., 0.99, and who knows how many factor/covariate combinations $N$. Yes, as is true of all multivariate situations, it does do all 99 tau values, then pares it down if at gives a smaller selection. So I suspect it is requiring a humongous amount of storage -- especially considering that it will be storing a covariance matrix of size $99N \times 99N$ before paring it down. The workaround would be fitting smaller models with just some of the taus at a time. I know that is an inconvenience, and I'm sorry about that -- but OTOH you are asking for a whole lot of stuff at once.


Looking back at this, I realize I was wrong. The dimension of the covariance matrix will be $99p\times99p$ where $p$ is the number of regression coefficients. Dunno what I was thinking...

mattmoo commented 6 months ago

Well, I put that in the NEWS file, and updated the documentation. That's a fairly conventional way of communicating when changes are made, especially when it is a welcome improvement (which I think it is, largely). My sense is that a relatively small fraction of of users do quantile regression, so I didn't want to put it in a splashy startup message. (BTW, I fixed the formatting of the last part of that section of the models vignette.)

I think that's definitely the case! However, perhaps it is possible to throw a warning if a deprecated tau argument is specified?

If I understand it, you wanted to those trend estimates for tau = 0.01, 0.02, ..., 0.99, and who knows how many factor/covariate combinations . Yes, as is true of all multivariate situations, it does do all 99 tau values, then pares it down if at gives a smaller selection. So I suspect it is requiring a humongous amount of storage -- especially considering that it will be storing a covariance matrix of size before paring it down. The workaround would be fitting smaller models with just some of the taus at a time. I know that is an inconvenience, and I'm sorry about that -- but OTOH you are asking for a whole lot of stuff at once.

I've downgraded to 1.8.8 for now, and have confirmed that the RAM utilisation is stable and far lower, never getting above 3GB. I certainly understand that it won't be a priority for you, but the new implementation is far more more memory intensive. I'll fit them separately in future; when you work with quantile regression, you get used to inconvenience :)

yuryzablotski commented 6 months ago

would the performance problem be solved when there are only specific taus allowed? for example maximum 20 of them: seq(0, 1, 0.05). Nobody needs 100 or more taus. And the use and popularity of quantile regression may grow in the future solely because it's estimates median instead of the mean. thanks for great work!

rvlenth commented 6 months ago

We have evidence here of a user who does want that many tau values. And if they chose that many of them when using rq(), then who am I to try to guess which ones in object$tau they really wanted? And BTW, seq(0, 1, 0.05) includes two invalid tau choices that are guaranteed not to be in object$tau.

If I had originally designed the support for rq objects the way it is now (starting in version 1.9.0), then nobody would have grounds to object because they'd just be facing the natural consequences of their choice to use so many tau values. It's a more complicated situation now, since there is a precedent for allowing a tau argument. One issue is that it is messy to program for a subset of taus, because they must all match values in object$tau. I'd rather not bulk up the code that way.

rvlenth commented 6 months ago

Well, I did it anyway...

> library(emmeans)
> library(quantreg)
> mod = rq(mpg ~ factor(cyl), tau = c(.25, .5, .75), data = mtcars)
Warning message:
In rq.fit.br(x, y, tau = tau, ...) : Solution may be nonunique

> # I really wish this message would tell us *which* tau is problematic...

> ref_grid(mod)  # default, tau unspecified
'emmGrid' object with variables:
    cyl = 4, 6, 8
    tau = multivariate response levels: 0.25, 0.50, 0.75

> ref_grid(mod, tau = c(.1, .25, .5))   # tau specified, including an unavailable one
'emmGrid' object with variables:
    cyl = 4, 6, 8
    tau = multivariate response levels: 0.25, 0.50

> ref_grid(mod, tau = .15)   # unavailable tau
Error in emm_basis.rq(object, ...) : No valid 'tau' values were specified

I need to update the documentation, then I will push to GitHub

mattmoo commented 6 months ago

would the performance problem be solved when there are only specific taus allowed? for example maximum 20 of them: seq(0, 1, 0.05). Nobody needs 100 or more taus. And the use and popularity of quantile regression may grow in the future solely because it's estimates median instead of the mean. thanks for great work!

Thanks Yury. I think it's not uncommon to use quite a fine quantile net, and it enables quite robust visualisation for data where the relationship varies across the range (my output below).

image

Well, I did it anyway...

Thanks Russell, that's a great help!

yuryzablotski commented 6 months ago

thanks Russel, thanks Matt! I love the tau argument very much! It works in emmeans, even with interactions and it's very useful! Besides, it allowed me to plot predictions via emmip function, which is fantastic, because the visulisation for qr results is not good.

library(tidyverse)
library(ISLR)
library(quantreg) theme_set(theme_test())

qr_all <- rq(wage ~ jobclass + age, Wage, tau = seq(.05, .95, by = 0.05)) emmeans(qr_all, pairwise ~ jobclass, weights = "prop", by = "tau") emmip(qr_all, tau ~ age, CIs = T, at = list(age = c(25, 45, 65)))

qr_all_int <- rq(wage ~ jobclass*education, Wage, tau = seq(.25, .75, by = 0.25)) emmeans(qr_all_int, pairwise ~ education | jobclass, type = "response", infer = T, by = "tau") emmip(qr_all_int, tau ~ education | jobclass, type = "response", CIs = T)

Your visualisation, Matt, is fantastic! How did you do that? I programmed both predictions and estimates plots on my own via the broom::tidy() function, and have written my own function for being able to visualize separate predictors from multivariable qr, but that was hardly the most optimal way. Hier is the example below, and I will publish a blog article about how to program this function soon, in case on of you are interested. When my blog is out, I thought I would contact rq people (never did before) and ask them to improve the visualisation of qr beyond plot( summary(qr_all) ). But if such visualization (like yours above or my below) could be somehow integrated into emmeans package (not feature request, just loud thoughts), that would be genies.

Screenshot 2023-12-30 at 11 45 00

One more thing, if I may? I wasn't been able to use only few particular taus, similarly to using only particular ages. I tried a lot, but could not make the "at" or "by" or "list" to extract only 3 taus out of 99. I thought if that works for age, it should be working for taus. At the end, it's not a problem to actually rerun the model with only those particular taus I wanted to select and use all contrasts via by = "tau". So, it's also not a new issue, but I thought I mention it.

rvlenth commented 6 months ago

@yuryzablotski Did you re-install emmeans from GitHub? Follow the instructions on the "Code" page.

yuryzablotski commented 6 months ago

yes, just did it again and it worked with few taus below .9 perfectly.

for instance this code produces what it suppose to: emmip(qr_all, tau ~ age, CIs = T, at = list(age = c(25, 45, 65), tau = c(.1, .2, .3)))

Thanks for that!

I guess this dataset doesn't have enough data for CIs for high taus, like tau = .9. Because when the list of higher tau values is specified, only the first in the list appears. Here is the example:

library(tidyverse) library(ISLR) library(quantreg) library(emmeans) qr_all <- rq(wage ~ jobclass + age + education, Wage, tau = seq(.05, .95, by = 0.05))

emmeans(qr_all, pairwise ~ jobclass|tau, weights = "prop", at = list(tau = c(.1, .9) )) emmip(qr_all, tau ~ age, CIs = T, at = list(age = c(25, 45, 65), tau = c(.1, .9)))

But since that's not a problem with small taus, I would just accept that with small dataset it would not work and would not used those quantiles for inference.

Besides, and interestingly, the uni-tau-riable regression has no problem with high quantiles, so I would also use separate taus, like in this example which works perfectly:

qr90 <- rq(wage ~ jobclass + age + education, Wage, tau = 0.9) emmip(qr90, ~ education, CIs = T)

I already took tons of your time and I really appreciate your effort! I wish you a happy New Year! Kind regards, Yury

rvlenth commented 6 months ago

@yuryzablotski

When you use the tau argument itself (which is what is recommended), it works:

> emmeans(qr_all, ~jobclass|tau, weights = "prop", tau = c(.1,.9))
tau = 0.1:
 jobclass       emmean   SE   df lower.CL upper.CL
 1. Industrial    73.4 1.31 2993     70.9     76.0
 2. Information   79.2 1.17 2993     76.9     81.5

tau = 0.9:
 jobclass       emmean   SE   df lower.CL upper.CL
 1. Industrial   155.6 3.29 2993    149.2    162.1
 2. Information  158.6 3.22 2993    152.3    165.0

Results are averaged over the levels of: education 
Confidence level used: 0.95 
Warning messages:
1: In summary.rq(xi, U = U, ...) : 1 non-positive fis
2: In summary.rq(xi, U = U, ...) : 10 non-positive fis

> emmip(qr_all, tau ~ age, CIs = T, tau = c(.1,.9), at = list(age = c(25,45,65)))
Warning messages:
1: In summary.rq(xi, U = U, ...) : 1 non-positive fis
2: In summary.rq(xi, U = U, ...) : 10 non-positive fis

image

When Iuse at instead, I get:

> emmeans(qr_all, ~jobclass|tau, weights = "prop", at = list(tau = c(.1, .9) ))
NOTE: A nesting structure was detected in the fitted model:
    jobclass %in% tau, education %in% tau
tau = 0.1:
 jobclass       emmean   SE   df lower.CL upper.CL
 1. Industrial    73.4 1.31 2993     70.9     76.0
 2. Information   79.2 1.17 2993     76.9     81.5

Results are averaged over the levels of: education 
Confidence level used: 0.95 
Warning messages:
1: In summary.rq(xi, U = U, ...) : 1 non-positive fis
2: In summary.rq(xi, U = U, ...) : 10 non-positive fis

I'm not sure exactly what is happening here, but the relevant output is in

NOTE: A nesting structure was detected in the fitted model: jobclass %in% tau, education %in% tau

which says that both jobclass and education are nested in tau -- and apparently these factors exist only for the first value of tau. This might have something to do with how at works with multivariate factors. I'll investigate. But you should use the tau argument, which I recommend anyway.

PS -- there are warning messages all over the place that come from the model. They concern me but apparently don't affect emmeans().

rvlenth commented 6 months ago

Interestingly, this nesting issue comes up only with certain tau values. (I have edited-out the warnings)

> ref_grid(qr_all, at = list(tau = c(.1,.8)))
'emmGrid' object with variables:
    jobclass = 1. Industrial, 2. Information
    age = 42.415
    education = 1. < HS Grad, 2. HS Grad, 3. Some College, 4. College Grad, 5. Advanced Degree
    tau = multivariate response levels: 0.1, 0.8

> ref_grid(qr_all, at = list(tau = c(.1,.85)))
NOTE: A nesting structure was detected in the fitted model:
    jobclass %in% tau, education %in% tau
'emmGrid' object with variables:
    jobclass = 1. Industrial, 2. Information
    age = 42.415
    education = 1. < HS Grad, 2. HS Grad, 3. Some College, 4. College Grad, 5. Advanced Degree
    tau = multivariate response levels: 0.10, 0.85
Nesting structure:  jobclass %in% tau, education %in% tau

> ref_grid(qr_all, at = list(tau = c(.1,.9)))
NOTE: A nesting structure was detected in the fitted model:
    jobclass %in% tau, education %in% tau
'emmGrid' object with variables:
    jobclass = 1. Industrial, 2. Information
    age = 42.415
    education = 1. < HS Grad, 2. HS Grad, 3. Some College, 4. College Grad, 5. Advanced Degree
    tau = multivariate response levels: 0.1, 0.9
Nesting structure:  jobclass %in% tau, education %in% tau

> ref_grid(qr_all, at = list(tau = c(.1,.95)))
'emmGrid' object with variables:
    jobclass = 1. Industrial, 2. Information
    age = 42.415
    education = 1. < HS Grad, 2. HS Grad, 3. Some College, 4. College Grad, 5. Advanced Degree
    tau = multivariate response levels: 0.10, 0.95

I think the reason is a bug in at where we require exact matching:

> alltau = seq(.05,.95,by=.05)
> alltau
 [1] 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95
> c(.1, .8, .85, .9, .95) %in% alltau
[1]  TRUE  TRUE FALSE FALSE  TRUE

... whereas the code for rq support uses fuzzy matching.

yuryzablotski commented 6 months ago

that's fantastic, thank you very much, Russel! tau argument works even for interactions. Yes, the warnings are from QR, and apparently emmeans() handles them well. I think "tau = c(...)" is even more elegant than "at = list(tau = c(...))", so if the fuzzy matching is not an issue for "tau = c(...)", could we just roll with it?

rvlenth commented 6 months ago

Nope, we can't ignore this, as it can affect other situations and other models where a multivariate factor has numeric reference levels and is subsetted in at. Fornunately, I think I have it now:

> emmeans(qr_all, ~jobclass|tau, weights = "prop", at = list(tau = c(.1, .9) ))
tau = 0.1:
 jobclass       emmean   SE   df lower.CL upper.CL
 1. Industrial    73.4 1.31 2993     70.9     76.0
 2. Information   79.2 1.17 2993     76.9     81.5

tau = 0.9:
 jobclass       emmean   SE   df lower.CL upper.CL
 1. Industrial   155.6 3.29 2993    149.2    162.1
 2. Information  158.6 3.22 2993    152.3    165.0

Results are averaged over the levels of: education 
Confidence level used: 0.95 
rvlenth commented 6 months ago

Thanks for reporting this. It is a subtle bug!

mattmoo commented 6 months ago

Your visualisation, Matt, is fantastic! How did you do that? I programmed both predictions and estimates plots on my own via the broom::tidy() function, and have written my own function for being able to visualize separate predictors from multivariable qr, but that was hardly the most optimal way. Hier is the example below, and I will publish a blog article about how to program this function soon, in case on of you are interested. When my blog is out, I thought I would contact rq people (never did before) and ask them to improve the visualisation of qr beyond plot( summary(qr_all) ). But if such visualization (like yours above or my below) could be somehow integrated into emmeans package (not feature request, just loud thoughts), that would be genies.

Thanks Yury, it's a custom function, and probably well beyond the scope of emmeans. The treatment of tau as a variable makes it a bit more straightforward, as there's no need to bind again afterwards. I'll attach the code, for reference.

I have mainly been using as.character() to match quantiles, I'm sure that the fuzzy matching will be far more robust.

emm_obj = emmeans(qr_all, pairwise ~ jobclass|tau, weights = "prop", infer = c(TRUE, TRUE))

library(data.table)
rq_summary_dt = as.data.table(emm_obj$contrasts)
highlight_quantiles_vector = c(0.1, 0.25, 0.5)

rq_summary_plot = ggplot(data = rq_summary_dt,
                         aes(x = tau)) +

  geom_hline(yintercept = 0, colour = 'red') +
  geom_line(aes(y = estimate)) +

  # Make ribbon
  geom_point(aes(y = upper.CL),
             shape = 6,
             size = .5,
             fill = 'grey20') +
  geom_point(aes(y = lower.CL),
             shape = 2,
             size = .5,
             fill = 'grey20') +
  geom_ribbon(aes(ymin = lower.CL,
                  ymax = upper.CL),
              alpha = .2)

# Highlight specified quantiles.
if (length(highlight_quantiles_vector) > 0) {
  # Little bit of coercion to get around floating point accuracy limits
  highlight_quantiles_data_dt = rq_summary_dt[as.character(tau) %in% as.character(highlight_quantiles_vector)]

  highlight_quantiles_data_long_dt = melt(
    highlight_quantiles_data_dt,
    measure = patterns("CL$"),
    value.name = 'bound'
  )
  rq_summary_plot = rq_summary_plot +
    ggrepel::geom_label_repel(
      data = highlight_quantiles_data_long_dt[variable == 'upper.CL'],
      aes(label = tau,
          y = bound),
      nudge_x = 0.03,
      nudge_y = highlight_quantiles_data_long_dt[variable == 'upper.CL', max(bound)/3]
      # hjust = 0
    ) +
    geom_line(
      data = highlight_quantiles_data_long_dt,
      aes(y = bound,
          group = tau),
      linetype = 'twodash',
      colour = 'red'
    ) +
    geom_point(
      data = highlight_quantiles_data_long_dt,
      aes(y = bound),
      shape = 22,
      fill = 'red'
    )
}

# Estimate points.
rq_summary_plot = rq_summary_plot + 
  geom_point(aes(y = estimate), size = 0.75) 

image

rvlenth commented 6 months ago

[Updated comment]

Looking at this quickly, it seems like we could get much of this via

RG <- ref_grid(qr_all)
emmip(pairs(emmeans(RG, ~jobclass|tau), weights="prop"), ~tau, CI=TRUE,
    xlab = "quantile", ylab = "jobclass difference") + 
ggplot2::coord_flip()

image

This is actually pretty comparable to @mattmoo 's plot, which I swear I did not see when I originally wrote this. I had been comparing this to the plot a long ways above with the green ribbon, and that's why the coord_flip() is added.

yuryzablotski commented 6 months ago

Nope, we can't ignore this, as it can affect other situations and other models where a multivariate factor has numeric reference levels and is subsetted in at. Fornunately, I think I have it now:

> emmeans(qr_all, ~jobclass|tau, weights = "prop", at = list(tau = c(.1, .9) ))
tau = 0.1:
 jobclass       emmean   SE   df lower.CL upper.CL
 1. Industrial    73.4 1.31 2993     70.9     76.0
 2. Information   79.2 1.17 2993     76.9     81.5

tau = 0.9:
 jobclass       emmean   SE   df lower.CL upper.CL
 1. Industrial   155.6 3.29 2993    149.2    162.1
 2. Information  158.6 3.22 2993    152.3    165.0

Results are averaged over the levels of: education 
Confidence level used: 0.95 

The yesterday's version of emmeans still shows the warning of the quantreg, but I probably need to update emmeans again. I'll do that soon.

yuryzablotski commented 6 months ago

The treatment of tau as a variable makes it a bit more straightforward, as there's no need to bind again afterwards.

Thanks for the code Matt! I like the "as.data.table" and "geom_label_repel" functions! My code is similar.

Treating "tau" as a variable makes a lot of sense to me. The only thing which still does not work for me are the contrasts between "taus". I am not sure it's possible, but it seems of practical utility to me. Interestingly, it gets the difference between taus, so it seems possible, but it throws NAs for SE and P-values.

emmeans(qr_all, pairwise ~ tau | jobclass, weights = "prop", tau = c(.1, .5, .9) )

Is there a way to fix that?

yuryzablotski commented 6 months ago

Looking at this quickly, it seems like we could get much of this via

emmip(pairs(emmeans(RG, ~jobclass|tau), weights="prop"), ~tau, CI=TRUE,
    xlab = "quantile", ylab = "jobclass difference") + 
ggplot2::coord_flip()

However, I get something quite different: image

Since your code for the ribbon appears to be the upper and lower confidence limits for the jobclass comparison, plotted against tau, I don't get why you have a monotone trend in those differences while mine winds around. What am I overlooking?

oops, I should have mentioned RG = ref_grid(qr_all)

that's actually really nice! I was even able to get multiple contrasts for education variable via "contrast" column in the results of emmeans+pairs, however, I could not get facets to scale = "free". So, I used a "as.data.table" recommended by Matt and produced controllable facets. The last problem was that I could not annotate all contrasts at once, so, I needed to to that for every contrast on it's own, which words, but is a lot of code. Still did not figured out how to connect contrasts + faceting + annotations_per_facet, but I'll continue to try, because contrasts are (in my opinion) the most important feature of emmeans and I will always need to use them. So, visualizing them would be terrific. Anyway, here is the code:

ref_grid(qr_all) %>% emmeans(~ education|tau) %>% pairs(weights="prop") %>% emmip(~ tau|contrast, CIs = T) + labs(x = "Quantile (tau)", y = "Education difference")+ coord_flip()

Screenshot 2023-12-31 at 11 16 24

plot_contrasts <- ref_grid(qr_all) %>% emmeans(~ education|tau) %>% pairs(weights="prop", infer = c(T,T)) %>% as.data.table() %>% rstatix::add_significance("p.value") %>% rename(p = p.value.signif) %>% select(-SE, -df, -t.ratio, -p.value) %>% rename(LCL = lower.CL, UCL = upper.CL)

without annotations

ggplot(data = plot_contrasts, aes(x = tau, y = estimate))+ geom_point(color = "green", size = 3)+ # color #27408b geom_line( color = "green", size = 1)+ facet_wrap(~contrast, scales = "free")+ geom_ribbon(aes(ymin = LCL, ymax = UCL), alpha = 0.25, fill = "green")+ coord_flip()+ scale_x_reverse(breaks = seq(from = 0, to = 1, by = 0.1))+

Intercept

geom_hline(yintercept = 0, alpha = .3, size = 1)

Screenshot 2023-12-31 at 11 16 44

with annotations by without facets

contrats_1 <- plot_contrasts %>% filter(contrast == "1. < HS Grad - 2. HS Grad")

ggplot(data = contrats_1, aes(x = tau, y = estimate))+ geom_point(color = "green", size = 3)+ # color #27408b geom_line( color = "green", size = 1)+ geom_ribbon(aes(ymin = LCL, ymax = UCL), alpha = 0.25, fill = "green")+ coord_flip()+ scale_x_reverse(breaks = seq(from = 0, to = 1, by = 0.1))+

Intercept

geom_hline(yintercept = 0, alpha = .3, size = 1)+

annotate

annotate(geom = "text", x = contrats_1$tau, y = contrats_1$estimate, hjust = -.19, label = paste( round(contrats_1$estimate, 2), contrats_1$p ))

Screenshot 2023-12-31 at 11 16 58
rvlenth commented 6 months ago

@yuryzablotski Comments on recent postings...

The yesterday's version of emmeans still shows the warning of the quantreg ...

The new version does not suppress the warning. I just edited-out the warnings in my posting, to reduce clutter.

Treating "tau" as a variable makes a lot of sense to me. The only thing which still does not work for me are the contrasts between "taus". I am not sure it's possible, but it seems of practical utility to me. Interestingly, it gets the difference between taus, so it seems possible, but it throws NAs for SE and P-values.

This cannot be avoided, and that fact is documented. The tau values are handled one at a time by rq(), and it does not provide covariances between parameters estimates having different tau values. Changing this would require a change in the quantreg package, and that seems both unlikely, and impractical due to the potential mammoth size of the covariance matrix.

(plot experiments)

I wonder what Tufte would say about the annotated plot with intercept. He criticized "chart junk" -- plots that were too busy -- and this may be the case. We can see whether the ribbon crosses the intercept, so the inclusion of asterisks seems completely unnecessary: the distance between the ribbon and the zero line illustrates the significance to much higher resolution than the asterisks. (I am pretty much an anti-asterisk person anyway, and have them disabled in my R startup options. I think people should actually read the P values.) The annotations introduce considerable clutter and require a lot of space. Instead of the annotations, increase the number of tick labels and maybe even have sub-ticks, then it'd be possible to discern the values to 2 digits accuracy.

I actually like the very first plot the best, because of the common scale. Guess that establishes me as a fossil, which of course I am (1975 PhD).

yuryzablotski commented 6 months ago

you are right regarding the clutter on the plot and asterisks. My vet-med-university colleagues, however, like (or able to appreciate) only very visual and very simple statistical inference. E.g. P-values for them are mostly black and white :), < 0.05 or >0.05. They don't wanna read p-values, they only wanna know whether it's significant. I work against it, but it's hard, so, at least few categories of p-values as asterisks is my first attempt to brake the significant-not-significant mentality. But yeah, still does not make sense to include asterisks.

I already assumed, that quantreg does not compare the taus. But theoretically, when the ribbons of tau = 0.1 and tau = 0.9 do not cross/overlapp, I would consider them significantly different.

The first plot is great! Especially the simplicity of code! By the way, where would our world be without value and knowledge of fossils? ;) Love emmeans package and thank you for it again!