Closed janellelennie closed 2 months ago
Hi @janellelennie - Thanks for reaching out; hope everything's going well with you!
I put together some example code to try to investigate this. I think it's not what you're seeing but could you help me change it to reproduce what you have?
library(pmplots)
#> Loading required package: ggplot2
library(yspec)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data <- ys_help$data()
spec <- ys_help$spec()
data1
has all levels of RF
data1 <- ys_factors(data, spec)
data2
drops severe renal impairment before adding factors
data2 <- filter(data, RF != "sev") %>% ys_factors(spec)
count(data1, RF)
#> RF n
#> 1 Normal 3280
#> 2 Mild 360
#> 3 Moderate 360
#> 4 Severe 360
levels(data1$RF)
#> [1] "Normal" "Mild" "Moderate" "Severe"
count(data2, RF)
#> RF n
#> 1 Normal 3280
#> 2 Mild 360
#> 3 Moderate 360
levels(data2$RF)
#> [1] "Normal" "Mild" "Moderate" "Severe"
cont_cat()
, all levels
cont_cat(data1, x = "RF", y = "WT")
cont_cat()
missing severe
cont_cat(data2, x = "RF", y = "WT")
ggplot all levels
ggplot(data = data1) +
geom_boxplot(aes(x = RF, y = WT))
ggplot missing severe
ggplot(data = data2) +
geom_boxplot(aes(x = RF, y = WT))
Created on 2024-08-02 with reprex v2.1.1
Thank you, all is well, I hope you are doing great!
Ah. As I went to run your code, I got snagged on ys_factors
. That appears to be a newer function. Our Metworx blueprint has "validated" packages for us to use at an old snapshot, which is pmplots 0.3.5. I don't think the version has changed functionality for cont_cov, but will check and send you a reproducible version using the yspec example. What we need to do is make the DV values missing for the severe group, and the example will probably replicate my example below.
With the real data, here is what's happening. My yspec has 13 levels of race, only 9 present in the dataset at this time.
Ps. I will use an anonymous Y variable label for sake of data security.
> levels(cov_dat$IARACEN)
[1] "African American/African" "American Indian or Alaskan Native" "Asian - Central/South Asian"
[4] "Asian - East Asian" "Asian - Japanese" "Asian - South East Asian"
[7] "Native Hawiian or Pacific Islander" "White - Arabic/North African Heritage" "White - Caucasian/European Heritage"
[10] "Mixed White" "Mixed Asian" "Multiple"
[13] "Missing"
> unique(cov_dat$IARACEN)
[1] White - Caucasian/European Heritage Missing American Indian or Alaskan Native
[4] Multiple Asian - Japanese Asian - Central/South Asian
[7] African American/African Native Hawiian or Pacific Islander Asian - East Asian
13 Levels: African American/African American Indian or Alaskan Native Asian - Central/South Asian ... Missing
Running pmplots, this is my output. I noticed it was odd because for "Multiple N=3" there are a lot of points, which doesn't make sense for N=3. I realized that box & whiskers should actually be the White N=164 box & whiskers and realized the levels were wrong. What is happening is that there is a single "Asian - Central/South Asian" subject, however they have a missing continuous covariate record. So there is no value to plot for that subject on the y-axis.
cont_cat(df = cov_dat, x = "IARACEN", y = "Y") +rot_x(45)
There is an NA label because of what is happening, too
I compared using ggplot and got this:
ggplot() +
geom_boxplot(data=cov_dat, aes(x = factor(IARACEN), y = Y) ) +rot_x(45)
Now we can see all 9 levels which are present in the dataset, and no value for the "Asian - Central/South Asian" subject who is missing their Y concentration. In pmplots, it should theoretically say "Asian - Central/South Asian N=1", but just have no data in the boxplot. Is the pmplots function doing an na.rm=T somewhere which is dropping this missing record?
@kylebaron if you replace your data2 derivation from data2 <- filter(data, RF != "sev") %>% ys_factors(spec)
to this:
data2 <- data %>% mutate(WT = ifelse(RF == "sev", NA, WT)) %>% ys_factors(spec)
You'll then get:
cont_cat(data2, x = "RF", y = "WT")
which is mostly reproducing my issue
IC ... so there's still records in the data set, but they are all NA.
Right. Perhaps this isnt a typical use case, so if it isn't worth your time no worries! I expect the next data cut to add many more races, and want this EDA script to be reusable come new data. The workaround would be to comment those levels out of the yspec for now.
Ok; I'm reproducing this now. It's definitely related to the data summary that we have to do to get the numbers in the tick labels; you can opt out of that summary and the plot shows as expected. There isn't a very public workaround for this (argument), but let me see if we can do something else.
library(pmplots)
#> Loading required package: ggplot2
library(yspec)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data <- ys_help$data()
spec <- ys_help$spec()
data1
has all levels of RF
data1 <- ys_add_factors(data, spec, .suffix = "")
data2
drops severe renal impairment before adding factors
data2 <-
data %>%
mutate(WT = ifelse(RF=="sev", NA, WT)) %>%
ys_factors(spec)
cont_cat()
, all levels
cont_cat(data1, x = "RF", y = "WT")
cont_cat()
missing severe
cont_cat(data2, x = "RF", y = "WT")
#> Warning: Removed 360 rows containing non-finite outside the scale range
#> (`stat_boxplot()`).
cont_cat()
missing severe, don’t show number
cont_cat(data2, x = "RF", y = "WT", shown = FALSE)
#> Warning: Removed 360 rows containing non-finite outside the scale range
#> (`stat_boxplot()`).
Created on 2024-08-02 with reprex v2.1.1
Yep, looks like it. Okay great, thank you for the detective work. Thinking about it again, I can't comment the levels out in the yaml because this race level will be used in the other EDA plots. For another workaround I could write a little function to filter out NAs for the DV before plotting
@janellelennie can you try this?
EDIT: see the one below.
Oops .. hang on .
This one
library(pmplots)
#> Loading required package: ggplot2
library(yspec)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data <- ys_help$data()
spec <- ys_help$spec()
data1
has all levels of RF
data1 <- ys_add_factors(data, spec, .suffix = "")
data2
drops severe renal impairment before adding factors
data2 <-
data %>%
mutate(WT = ifelse(RF=="sev", NA, WT)) %>%
ys_factors(spec)
ff <- function (df, x, y) {
.xcol <- rlang::sym(x)
.ycol <- rlang::sym(y)
df <- mutate(df, notna = !is.na(!!.ycol))
df <- group_by(df, !!.xcol)
.sum <- summarize(df, n = sum(notna), N = n_distinct(ID[notna]))
.sum <- ungroup(.sum)
as.data.frame(.sum)
}
assignInNamespace("box_labels", value = ff, "pmplots")
cont_cat()
, all levels
cont_cat(data1, x = "RF", y = "WT")
cont_cat()
missing severe
cont_cat(data2, x = "RF", y = "WT")
#> Warning: Removed 360 rows containing non-finite outside the scale range
#> (`stat_boxplot()`).
cont_cat()
missing severe, don’t show number
cont_cat(data2, x = "RF", y = "WT", shown = FALSE)
#> Warning: Removed 360 rows containing non-finite outside the scale range
#> (`stat_boxplot()`).
Created on 2024-08-02 with reprex v2.1.1
You're amazing! I added the ff and assignInNamespec pieces to my script and it works:
So really the interpretation of these plots is that the N's reflect the number of datapoints in the boxplots, not the number of subjects in that Categorical cov group. So there may be subjects with missing Y values for any of the categories, and they are not included in the count for N. So I suppose it's more of a small "n" than a big "N". But the demographics tables are the anchors for the Categorical covariate summaries, this makes sense as-is since its reflecting the data in the plot. Just thinking out loud :)
Yeah, I should look at the specifics on that with consistency in mind. It was soo long ago that we did this, probably should revisit in case anything needs clarification or even changing.
Thanks for the report on this and pulling in the workaround. I'll open an issue to make sure this gets fixed.
Have a great weekend, @janellelennie!
Sounds great, thanks for being so quick to respond!! So nice "chatting" with you 😊 Have a great weekend as well!
Hi Kyle!
I'm using yspec and pmplots to make some EDA plots. There is a level in the yspec factors for which there is not (yet) data in the dataset (e.g. N=0 for that level). Plotting using
cont_cat
, the x-axis of the boxplot is thrown off: the axis is being labeled as the levels from yspec, but the data is plotting only the levels with data. Is there an option that we can set to "show all categorical levels even if N=0" ? When I use geom_boxplot instead of cont_cat, all levels do show on the x-axis, including the level with no data.Thanks, Janelle