metrumresearchgroup / pmplots

Plots for Pharmacometrics
https://metrumresearchgroup.github.io/pmplots
8 stars 1 forks source link

cont_cat: showing categories where N = 0 in x-axis #100

Closed janellelennie closed 2 months ago

janellelennie commented 2 months ago

Hi Kyle!

I'm using yspec and pmplots to make some EDA plots. There is a level in the yspec factors for which there is not (yet) data in the dataset (e.g. N=0 for that level). Plotting using cont_cat, the x-axis of the boxplot is thrown off: the axis is being labeled as the levels from yspec, but the data is plotting only the levels with data. Is there an option that we can set to "show all categorical levels even if N=0" ? When I use geom_boxplot instead of cont_cat, all levels do show on the x-axis, including the level with no data.

Thanks, Janelle

kylebaron commented 2 months ago

Hi @janellelennie - Thanks for reaching out; hope everything's going well with you!

I put together some example code to try to investigate this. I think it's not what you're seeing but could you help me change it to reproduce what you have?

library(pmplots)
#> Loading required package: ggplot2
library(yspec)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

Test data and spec

data <- ys_help$data()
spec <- ys_help$spec()

data1 has all levels of RF

data1 <- ys_factors(data, spec)

data2 drops severe renal impairment before adding factors

data2 <- filter(data, RF != "sev") %>% ys_factors(spec)

count(data1, RF)
#>         RF    n
#> 1   Normal 3280
#> 2     Mild  360
#> 3 Moderate  360
#> 4   Severe  360
levels(data1$RF)
#> [1] "Normal"   "Mild"     "Moderate" "Severe"

count(data2, RF)
#>         RF    n
#> 1   Normal 3280
#> 2     Mild  360
#> 3 Moderate  360
levels(data2$RF)
#> [1] "Normal"   "Mild"     "Moderate" "Severe"

cont_cat(), all levels

cont_cat(data1, x = "RF", y = "WT")

cont_cat() missing severe

cont_cat(data2, x = "RF", y = "WT")

ggplot all levels

ggplot(data = data1) + 
  geom_boxplot(aes(x = RF, y = WT))

ggplot missing severe

ggplot(data = data2) + 
  geom_boxplot(aes(x = RF, y = WT))

Created on 2024-08-02 with reprex v2.1.1

janellelennie commented 2 months ago

Thank you, all is well, I hope you are doing great!

Ah. As I went to run your code, I got snagged on ys_factors. That appears to be a newer function. Our Metworx blueprint has "validated" packages for us to use at an old snapshot, which is pmplots 0.3.5. I don't think the version has changed functionality for cont_cov, but will check and send you a reproducible version using the yspec example. What we need to do is make the DV values missing for the severe group, and the example will probably replicate my example below.


With the real data, here is what's happening. My yspec has 13 levels of race, only 9 present in the dataset at this time.

Ps. I will use an anonymous Y variable label for sake of data security.


> levels(cov_dat$IARACEN)
 [1] "African American/African"              "American Indian or Alaskan Native"     "Asian - Central/South Asian"          
 [4] "Asian - East Asian"                    "Asian - Japanese"                      "Asian - South East Asian"             
 [7] "Native Hawiian or Pacific Islander"    "White - Arabic/North African Heritage" "White - Caucasian/European Heritage"  
[10] "Mixed White"                           "Mixed Asian"                           "Multiple"                             
[13] "Missing"  

> unique(cov_dat$IARACEN)
[1] White - Caucasian/European Heritage Missing                             American Indian or Alaskan Native  
[4] Multiple                            Asian - Japanese                    Asian - Central/South Asian        
[7] African American/African            Native Hawiian or Pacific Islander  Asian - East Asian                 
13 Levels: African American/African American Indian or Alaskan Native Asian - Central/South Asian ... Missing

Running pmplots, this is my output. I noticed it was odd because for "Multiple N=3" there are a lot of points, which doesn't make sense for N=3. I realized that box & whiskers should actually be the White N=164 box & whiskers and realized the levels were wrong. What is happening is that there is a single "Asian - Central/South Asian" subject, however they have a missing continuous covariate record. So there is no value to plot for that subject on the y-axis.

cont_cat(df = cov_dat, x = "IARACEN", y = "Y") +rot_x(45)
image

There is an NA label because of what is happening, too

I compared using ggplot and got this:

ggplot() +
  geom_boxplot(data=cov_dat, aes(x = factor(IARACEN), y = Y) ) +rot_x(45)
image

Now we can see all 9 levels which are present in the dataset, and no value for the "Asian - Central/South Asian" subject who is missing their Y concentration. In pmplots, it should theoretically say "Asian - Central/South Asian N=1", but just have no data in the boxplot. Is the pmplots function doing an na.rm=T somewhere which is dropping this missing record?

janellelennie commented 2 months ago

@kylebaron if you replace your data2 derivation from data2 <- filter(data, RF != "sev") %>% ys_factors(spec) to this:

data2 <- data %>% mutate(WT = ifelse(RF == "sev", NA, WT)) %>% ys_factors(spec)

You'll then get:

cont_cat(data2, x = "RF", y = "WT")

image

which is mostly reproducing my issue

kylebaron commented 2 months ago

IC ... so there's still records in the data set, but they are all NA.

janellelennie commented 2 months ago

Right. Perhaps this isnt a typical use case, so if it isn't worth your time no worries! I expect the next data cut to add many more races, and want this EDA script to be reusable come new data. The workaround would be to comment those levels out of the yspec for now.

kylebaron commented 2 months ago

Ok; I'm reproducing this now. It's definitely related to the data summary that we have to do to get the numbers in the tick labels; you can opt out of that summary and the plot shows as expected. There isn't a very public workaround for this (argument), but let me see if we can do something else.

library(pmplots)
#> Loading required package: ggplot2
library(yspec)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

Test data and spec

data <- ys_help$data()
spec <- ys_help$spec()

data1 has all levels of RF

data1 <- ys_add_factors(data, spec, .suffix = "")

data2 drops severe renal impairment before adding factors

data2 <- 
  data %>% 
  mutate(WT = ifelse(RF=="sev", NA, WT)) %>% 
  ys_factors(spec)

cont_cat(), all levels

cont_cat(data1, x = "RF", y = "WT")

cont_cat() missing severe

cont_cat(data2, x = "RF", y = "WT")
#> Warning: Removed 360 rows containing non-finite outside the scale range
#> (`stat_boxplot()`).

cont_cat() missing severe, don’t show number

cont_cat(data2, x = "RF", y = "WT", shown = FALSE)
#> Warning: Removed 360 rows containing non-finite outside the scale range
#> (`stat_boxplot()`).

Created on 2024-08-02 with reprex v2.1.1

janellelennie commented 2 months ago

Yep, looks like it. Okay great, thank you for the detective work. Thinking about it again, I can't comment the levels out in the yaml because this race level will be used in the other EDA plots. For another workaround I could write a little function to filter out NAs for the DV before plotting

kylebaron commented 2 months ago

@janellelennie can you try this?

EDIT: see the one below.

kylebaron commented 2 months ago

Oops .. hang on .

kylebaron commented 2 months ago

This one

library(pmplots)
#> Loading required package: ggplot2
library(yspec)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

Test data and spec

data <- ys_help$data()
spec <- ys_help$spec()

data1 has all levels of RF

data1 <- ys_add_factors(data, spec, .suffix = "")

data2 drops severe renal impairment before adding factors

data2 <- 
  data %>% 
  mutate(WT = ifelse(RF=="sev", NA, WT)) %>% 
  ys_factors(spec)

ff <- function (df, x, y) {
  .xcol <- rlang::sym(x)
  .ycol <- rlang::sym(y)
  df <- mutate(df, notna = !is.na(!!.ycol))
  df <- group_by(df, !!.xcol)
  .sum <- summarize(df, n = sum(notna), N = n_distinct(ID[notna]))
  .sum <- ungroup(.sum)
  as.data.frame(.sum)
}

assignInNamespace("box_labels", value = ff, "pmplots")

cont_cat(), all levels

cont_cat(data1, x = "RF", y = "WT")

cont_cat() missing severe

cont_cat(data2, x = "RF", y = "WT")
#> Warning: Removed 360 rows containing non-finite outside the scale range
#> (`stat_boxplot()`).

cont_cat() missing severe, don’t show number

cont_cat(data2, x = "RF", y = "WT", shown = FALSE)
#> Warning: Removed 360 rows containing non-finite outside the scale range
#> (`stat_boxplot()`).

Created on 2024-08-02 with reprex v2.1.1

janellelennie commented 2 months ago

You're amazing! I added the ff and assignInNamespec pieces to my script and it works:

image

So really the interpretation of these plots is that the N's reflect the number of datapoints in the boxplots, not the number of subjects in that Categorical cov group. So there may be subjects with missing Y values for any of the categories, and they are not included in the count for N. So I suppose it's more of a small "n" than a big "N". But the demographics tables are the anchors for the Categorical covariate summaries, this makes sense as-is since its reflecting the data in the plot. Just thinking out loud :)

kylebaron commented 2 months ago

Yeah, I should look at the specifics on that with consistency in mind. It was soo long ago that we did this, probably should revisit in case anything needs clarification or even changing.

Thanks for the report on this and pulling in the workaround. I'll open an issue to make sure this gets fixed.

Have a great weekend, @janellelennie!

janellelennie commented 2 months ago

Sounds great, thanks for being so quick to respond!! So nice "chatting" with you 😊 Have a great weekend as well!