Closed kkmann closed 7 months ago
For simulated example datasets, I wonder if we could include the code as specialized functions rather than saving the datasets themselves in the package.
We can do both (recommended by r-pkgs) would be good to have a standard dataset for all examples etc (historical data + new) and include the script to generate it see https://r-pkgs.org/data.html#sec-data-data-raw.
We should link this with #10 to make sure it covers our scope.
In the following, I list a few datasets that are all available from other R packages (on CRAN) and that may serve as exemplary datasets to fit MMRMs. (An MMRM probably isn't the ideal model everywhere but would at least have some justification.)
As discussed this week, I will further investigate what it would possibly require to obtain data from TransCelerate (i.e. control group data) for our purposes.
Orthodont - a dataset that contains measurements of an orthodontic distance (mm) at four different ages (8, 10, 12, and 14 years) for 27 children.
> library(nlme)
> help(Orthodont)
> data(Orthodont)
> head(Orthodont)
Grouped Data: distance ~ age | Subject
distance age Subject Sex
1 26.0 8 M01 Male
2 25.0 10 M01 Male
3 29.0 12 M01 Male
4 31.0 14 M01 Male
5 21.5 8 M02 Male
6 22.5 10 M02 Male
> dim(Orthodont)
[1] 108 4
sleepstudy - a dataset that contains longitudinal data on the reaction times of subjects in a sleep deprivation study.
> library(lme4)
> help(sleepstudy)
> data(sleepstudy)
> head(sleepstudy)
Reaction Days Subject
1 249.5600 0 308
2 258.7047 1 308
3 250.8006 2 308
4 321.4398 3 308
5 356.8519 4 308
6 414.6901 5 308
> dim(sleepstudy)
[1] 180 3
dietox - a dataset that contains measurements of the weight of pigs at different time points (weeks) for 72 pigs.
> library(geepack)
> help(dietox)
> data(dietox)
> head(dietox)
Pig Evit Cu Litter Start Weight Feed Time
1 4601 Evit000 Cu000 1 26.5 26.50000 NA 1
2 4601 Evit000 Cu000 1 26.5 27.59999 5.200005 2
3 4601 Evit000 Cu000 1 26.5 36.50000 17.600000 3
4 4601 Evit000 Cu000 1 26.5 40.29999 28.500000 4
5 4601 Evit000 Cu000 1 26.5 49.09998 45.200001 5
6 4601 Evit000 Cu000 1 26.5 55.39999 56.900002 6
> dim(dietox)
[1] 861 8
polyps - a dataset of an RCT of Sulindac for Polyp Prevention in Familial Adenomatous Polyposis
> library(medicaldata)
> help(polyps)
> data(polyps) # not in long format
> head(polyps)
participant_id sex age baseline treatment number3m number12m
1 001 female 17 7 sulindac 6 NA
2 002 female 20 77 placebo 67 63
3 003 male 16 7 sulindac 4 2
4 004 female 18 5 placebo 5 28
5 005 male 22 23 sulindac 16 17
6 006 female 13 35 placebo 31 61
> dim(polyps)
[1] 22 7
cdystonia - a dataset from a multicenter RCT of botulinum toxin type B (BotB) in patients with cervical dystonia from nine U.S. sites
Description of dataset | The dataset was also used in this book: Statistical Methods for the Analysis of Repeated Measurements
> library(Hmisc)
> getHdata(cdystonia)
> head(cdystonia)
week site id treat age sex twstrs
1 1 0 1 1 5000U 65 F 32
1 2 2 1 1 5000U 65 F 30
1 3 4 1 1 5000U 65 F 24
1 4 8 1 1 5000U 65 F 37
1 5 12 1 1 5000U 65 F 39
1 6 16 1 1 5000U 65 F 36
> dim(cdystonia)
[1] 631 7
BtheB - a dataset from a clinical trial of an interactive multimedia program called ‘Beat the Blues'
See also this vignette | The dataset was also used in this book: Handbook of Statistical Analyses Using R (chapter: Analyzing Longitudinal Data I)
> library(HSAUR3)
> help(BtheB)
> data(BtheB) # in wide format
> BtheB$subject <- factor(rownames(BtheB))
> nobs <- nrow(BtheB)
> BtheB_long <- reshape(
+ BtheB, idvar = "subject",
+ varying = c("bdi.2m", "bdi.3m", "bdi.5m", "bdi.8m"),
+ direction = "long")
> BtheB_long$time <- rep(c(2, 3, 5, 8), rep(nobs, 4))
> subset(BtheB_long, subject %in% c("1", "2", "3"))
drug length treatment bdi.pre subject time bdi
1.2m No >6m TAU 29 1 2 2
2.2m Yes >6m BtheB 32 2 2 16
3.2m Yes <6m TAU 25 3 2 20
1.3m No >6m TAU 29 1 3 2
2.3m Yes >6m BtheB 32 2 3 24
3.3m Yes <6m TAU 25 3 3 NA
1.5m No >6m TAU 29 1 5 NA
2.5m Yes >6m BtheB 32 2 5 17
3.5m Yes <6m TAU 25 3 5 NA
1.8m No >6m TAU 29 1 8 NA
2.8m Yes >6m BtheB 32 2 8 20
3.8m Yes <6m TAU 25 3 8 NA
> summary(BtheB_long)
drug length treatment bdi.pre subject time bdi
No :224 <6m:196 TAU :192 Min. : 2.00 1 : 4 Min. :2.00 Min. : 0.00
Yes:176 >6m:204 BtheB:208 1st Qu.:15.00 10 : 4 1st Qu.:2.75 1st Qu.: 6.00
Median :22.00 100 : 4 Median :4.00 Median :12.50
Mean :23.33 11 : 4 Mean :4.50 Mean :14.43
3rd Qu.:30.25 12 : 4 3rd Qu.:5.75 3rd Qu.:21.00
Max. :49.00 13 : 4 Max. :8.00 Max. :53.00
(Other):376 NA's :120
> dim(BtheB_long)
[1] 400 7
The endpoint variable is bdi
(there is a considerable fraction of missing data).
Simulation of spirometric measurements
I was looking for publicly available data from (pharmaceutical) clincal trials specifically in pulmonology, since repeated spirometric measurements (continuous variables) are often used as endpoints in this field and are typically analyzed using MMRMs. I found this paper by Santermans et al. (2019). They developed a simulation approach for the endpoint "forced vital capacity" (FVC) based on clinical trial summaries and home spirometry data, with R code given in a supplement. At first glance it looks promising to perhaps be used to quite flexibly generate realistic data for testing purposes and exemplary analyses, e.g. for vignettes. I'll try to get this running.
What a treasure trove you found, @chstock! Thank you so much for finding these datasets! I think this is a promising start for a collection of analyses comparing brms.mmrm
to mmrm
in terms of fixed effect parameter estimates, covariance/correlation estimates, and associated uncertainties.
I think we may have talked about whether these analyses should be vignettes, unit tests, or part of an external test suite. I am not sure I recall what we decided.
I think this part of the effort would be a good opportunity for someone on our team, either you or @kkmann, @yonicd, @andrew-bean, @danleibovitz, etc., to test drive the package and see what you think.
These look great! One other possibility is the artificial FEV-1 dataset that is built into the mmrm
package. https://openpharma.github.io/mmrm/latest-tag/reference/fev_data.html
Thanks, @wlandau! I agree, this should be something to start with. I am happy to work on this further, also on the brms.mmrm
and mmrm
comparison. If anyone else could start it very soon, please don't hesitate. I'd like to explore the spirometry data simulation first, because the data we may get there could be very close to phase II or III data in pharma (and there is even some connection to my work at BI).
I used the code from Santermans et al. (2019) pretty much without any modifications and obtained the desired dataset (code and output). Of note, I think the parameters are realisitc and well-justified in the paper. There is some interesting functionality w.r.t. missing data mechanisms. I still think it could be a nice basis for example data, or even for a simulation.
I received some initial feedback regarding TransCelerate/Vivli and will be able to report in our next meeting.
Sounds great, thanks for investigating.
Thanks everyone! It seems like we have plenty of datasets to choose from, and the clinidata
. Happy to reopen if more datasets are needed for #12.
One dataset enabling
data set containing studies with same/similar intervention and different ones/only control.
simulated + save to pkg.