openpharma / mmrm

Mixed Models for Repeated Measures (MMRM) in R.
https://openpharma.github.io/mmrm/
Other
132 stars 23 forks source link

Numerical reproducibility issue for mmrm fit #462

Closed tobiasmuetze closed 1 month ago

tobiasmuetze commented 2 months ago

Summary

Different runs of the minimal example from the Package introduction vignette on the same machine can yield different results. The numerical differences between the results are small but could affect downstream calculations. Running the example twice in the same session will give the same results, but different sessions can yield different results. This issue occurs across different operating systems (Windows, Linux, macOS) and hardware. This might be related to issue #246. This is joint work with @luwidmer and @bailliem.

Your brief description of the problem


library(mmrm)
options(digits = 22)

mmrmFit <- mmrm(
  formula = FEV1 ~ RACE + SEX + ARMCD * AVISIT + us(AVISIT | USUBJID),
  data = fev_data
)
coef(mmrmFit)

# Results  [1] 30.777475482892566    1.5304997707921038    5.643565348078882     0.3260619220403493   
 [5] 3.774230043407513     4.839588454657073     10.342112883246966    15.053898262689723   
 [9] -0.041926250243732976 -0.6936853700125063   0.6242270324680373   

# Results from different session

 [1] 30.77747548289308    1.53049977079208     5.643565348078892    0.32606192204044154  3.7742300434068063  
 [6] 4.83958845465672     10.34211288324622    15.053898262689263   -0.04192625024332983 -0.6936853700115992 
[11] 0.624227032468643 

R session info

R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)                                                                                           
Running under: OpenShift Enterprise                                                                                              

Matrix products: default                                                                                                         
BLAS/LAPACK: /CHBS/apps/EB/software/imkl/2022.1.0-gompi-2022a/mkl/2022.1.0/lib/intel64/libmkl_gf_lp64.so.2;  LAPACK version 3.9.0

locale:                                                                                                                          
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C                                                                                     
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8                                                                           
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8                                                                          
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C                                                                                        
[9] LC_ADDRESS=C               LC_TELEPHONE=C                                                                                   
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C                                                                              

time zone: Europe/Prague                                                                                                         
tzcode source: system (glibc)                                                                                                    

attached base packages:                                                                                                          
[1] stats     graphics  grDevices utils     datasets  methods   base                                                             

other attached packages:                                                                                                         
[1] mmrm_0.3.11                                                                                                                  

loaded via a namespace (and not attached):                                                                                       
[1] crayon_1.5.2     vctrs_0.6.5      nlme_3.1-165     cli_3.6.2                                                                
[5] rlang_1.1.4      stringi_1.8.4    generics_0.1.3   bit_4.0.5                                                                
[9] glue_1.7.0       backports_1.5.0  TMB_1.9.11       rprojroot_2.0.4                                                          
[13] hms_1.1.3        fansi_1.0.6      grid_4.3.1       tibble_3.2.1                                                             
[17] tzdb_0.4.0       lifecycle_1.0.4  stringr_1.5.1    compiler_4.3.1                                                           
[21] Rcpp_1.0.12      pkgconfig_2.0.3  here_1.0.1       lattice_0.22-6                                                           
[25] R6_2.5.1         tidyselect_1.2.1 readr_2.1.5      utf8_1.2.4                                                               
[29] vroom_1.6.5      pillar_1.9.0     Rdpack_2.6       parallel_4.3.1                                                           
[33] rbibutils_2.2.16 magrittr_2.0.3   checkmate_2.3.1  Matrix_1.6-5                                                             
[37] bit64_4.0.5      tools_4.3.1       

OS / Environment

The outputs above were obtained on:

We have observed the same issue in different configurations (Windows 11 x64 on Ryzen; macOS Sonoma 14.6.1 on aarch64-apple-darwin20). We've also observed the issue also using previous package version (mmrm v0.2.2).

luwidmer commented 2 months ago

I also ran the following example using R -d valgrind, and interestingly, this makes it reproducible. However, the result depends on the size of the free list, which probably should not be the case (note that I computed digest(coef(mmrmFit)) here). Can also confirm the irreproducible results under Rscript on R 4.4.1 / Ubuntu 22.04 / mmrm 0.3.12, R 4.3.1 / Windows 11 / 0.3.12, R 4.4.1 / macOS / mmrm 0.3.12

Example under Rscript:

user@user-VirtualBox:~/Desktop/mmrm$ Rscript mmrm-test.R 
          (Intercept)                BASVAL                VISIT5 
           3.38611063           -0.27951009           -2.00084657 
               VISIT6                VISIT7        THERAPYPLACEBO 
          -2.70645552           -5.18326700           -0.09180645 
        BASVAL:VISIT5         BASVAL:VISIT6         BASVAL:VISIT7 
          -0.03439031           -0.11506873           -0.04678931 
VISIT5:THERAPYPLACEBO VISIT6:THERAPYPLACEBO VISIT7:THERAPYPLACEBO 
           1.49501234            2.31644127            2.89357908 
[1] "2d29335948cecae8a49034d195d670c2"
user@user-VirtualBox:~/Desktop/mmrm$ Rscript mmrm-test.R 
          (Intercept)                BASVAL                VISIT5 
           3.38611063           -0.27951009           -2.00084657 
               VISIT6                VISIT7        THERAPYPLACEBO 
          -2.70645552           -5.18326700           -0.09180645 
        BASVAL:VISIT5         BASVAL:VISIT6         BASVAL:VISIT7 
          -0.03439031           -0.11506873           -0.04678931 
VISIT5:THERAPYPLACEBO VISIT6:THERAPYPLACEBO VISIT7:THERAPYPLACEBO 
           1.49501234            2.31644127            2.89357908 
[1] "5acb56f26319a79ce756800c4bff6071"

Example under R -d "valgrind --malloc-fill=0x0 --free-fill=0x0"

user@user-VirtualBox:~/Desktop/mmrm$ R -d "valgrind --malloc-fill=0x0 --free-fill=0x0" --vanilla < mmrm-test.R 
==5173== Memcheck, a memory error detector
==5173== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==5173== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==5173== Command: /usr/lib/R/bin/exec/R --vanilla
==5173== 

R version 4.4.1 (2024-06-14) -- "Race for Your Life"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(rbmi)
> library(mmrm)
> library(digest)
> mmrmFit <- mmrm(CHANGE~BASVAL*VISIT+THERAPY*VISIT+us(VISIT|PATIENT), data=antidepressant_data,
+                 reml=TRUE)
> coef(mmrmFit)
          (Intercept)                BASVAL                VISIT5 
           3.38611063           -0.27951009           -2.00084657 
               VISIT6                VISIT7        THERAPYPLACEBO 
          -2.70645552           -5.18326700           -0.09180645 
        BASVAL:VISIT5         BASVAL:VISIT6         BASVAL:VISIT7 
          -0.03439031           -0.11506873           -0.04678931 
VISIT5:THERAPYPLACEBO VISIT6:THERAPYPLACEBO VISIT7:THERAPYPLACEBO 
           1.49501234            2.31644127            2.89357908 
> print(digest(coef(mmrmFit)))
[1] "e56c10b6917ed660479cd1e6edc37d0f"
> 
==5173== 
==5173== HEAP SUMMARY:
==5173==     in use at exit: 218,650,830 bytes in 47,487 blocks
==5173==   total heap usage: 6,922,908 allocs, 6,875,421 frees, 755,780,985 bytes allocated
==5173== 
==5173== LEAK SUMMARY:
==5173==    definitely lost: 0 bytes in 0 blocks
==5173==    indirectly lost: 0 bytes in 0 blocks
==5173==      possibly lost: 0 bytes in 0 blocks
==5173==    still reachable: 218,650,830 bytes in 47,487 blocks
==5173==                       of which reachable via heuristic:
==5173==                         newarray           : 4,264 bytes in 1 blocks
==5173==         suppressed: 0 bytes in 0 blocks
==5173== Rerun with --leak-check=full to see details of leaked memory
==5173== 
==5173== For lists of detected and suppressed errors, rerun with: -s
==5173== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

user@user-VirtualBox:~/Desktop/mmrm$ R -d "valgrind --malloc-fill=0x0 --free-fill=0x0" --vanilla < mmrm-test.R 
==5190== Memcheck, a memory error detector
==5190== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==5190== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==5190== Command: /usr/lib/R/bin/exec/R --vanilla
==5190== 

R version 4.4.1 (2024-06-14) -- "Race for Your Life"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(rbmi)
> library(mmrm)
> library(digest)
> mmrmFit <- mmrm(CHANGE~BASVAL*VISIT+THERAPY*VISIT+us(VISIT|PATIENT), data=antidepressant_data,
+                 reml=TRUE)
> coef(mmrmFit)
          (Intercept)                BASVAL                VISIT5 
           3.38611063           -0.27951009           -2.00084657 
               VISIT6                VISIT7        THERAPYPLACEBO 
          -2.70645552           -5.18326700           -0.09180645 
        BASVAL:VISIT5         BASVAL:VISIT6         BASVAL:VISIT7 
          -0.03439031           -0.11506873           -0.04678931 
VISIT5:THERAPYPLACEBO VISIT6:THERAPYPLACEBO VISIT7:THERAPYPLACEBO 
           1.49501234            2.31644127            2.89357908 
> print(digest(coef(mmrmFit)))
[1] "e56c10b6917ed660479cd1e6edc37d0f"
> 
==5190== 
==5190== HEAP SUMMARY:
==5190==     in use at exit: 218,650,830 bytes in 47,487 blocks
==5190==   total heap usage: 6,922,908 allocs, 6,875,421 frees, 755,780,985 bytes allocated
==5190== 
==5190== LEAK SUMMARY:
==5190==    definitely lost: 0 bytes in 0 blocks
==5190==    indirectly lost: 0 bytes in 0 blocks
==5190==      possibly lost: 0 bytes in 0 blocks
==5190==    still reachable: 218,650,830 bytes in 47,487 blocks
==5190==                       of which reachable via heuristic:
==5190==                         newarray           : 4,264 bytes in 1 blocks
==5190==         suppressed: 0 bytes in 0 blocks
==5190== Rerun with --leak-check=full to see details of leaked memory
==5190== 
==5190== For lists of detected and suppressed errors, rerun with: -s
==5190== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Example under R -d "valgrind --malloc-fill=0x0 --free-fill=0x0 --freelist-vol=1024"

user@user-VirtualBox:~/Desktop/mmrm$ R -d "valgrind --malloc-fill=0x0 --free-fill=0x0 --freelist-vol=1024" --vanilla < mmrm-test.R 
==5290== Memcheck, a memory error detector
==5290== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==5290== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==5290== Command: /usr/lib/R/bin/exec/R --vanilla
==5290== 
==5290== Warning: --freelist-big-blocks value 1000000 has no effect
==5290== as it is >= to --freelist-vol value 1024

R version 4.4.1 (2024-06-14) -- "Race for Your Life"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(rbmi)
> library(mmrm)
> library(digest)
> mmrmFit <- mmrm(CHANGE~BASVAL*VISIT+THERAPY*VISIT+us(VISIT|PATIENT), data=antidepressant_data,
+                 reml=TRUE)
> coef(mmrmFit)
          (Intercept)                BASVAL                VISIT5 
           3.38611063           -0.27951009           -2.00084657 
               VISIT6                VISIT7        THERAPYPLACEBO 
          -2.70645552           -5.18326700           -0.09180645 
        BASVAL:VISIT5         BASVAL:VISIT6         BASVAL:VISIT7 
          -0.03439031           -0.11506873           -0.04678931 
VISIT5:THERAPYPLACEBO VISIT6:THERAPYPLACEBO VISIT7:THERAPYPLACEBO 
           1.49501234            2.31644127            2.89357908 
> print(digest(coef(mmrmFit)))
[1] "4688b0fad363244c2e9fd97d173fe863"
> 
==5290== 
==5290== HEAP SUMMARY:
==5290==     in use at exit: 218,650,830 bytes in 47,487 blocks
==5290==   total heap usage: 6,922,908 allocs, 6,875,421 frees, 755,780,985 bytes allocated
==5290== 
==5290== LEAK SUMMARY:
==5290==    definitely lost: 0 bytes in 0 blocks
==5290==    indirectly lost: 0 bytes in 0 blocks
==5290==      possibly lost: 0 bytes in 0 blocks
==5290==    still reachable: 218,650,830 bytes in 47,487 blocks
==5290==                       of which reachable via heuristic:
==5290==                         newarray           : 4,264 bytes in 1 blocks
==5290==         suppressed: 0 bytes in 0 blocks
==5290== Rerun with --leak-check=full to see details of leaked memory
==5290== 
==5290== For lists of detected and suppressed errors, rerun with: -s
==5290== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

user@user-VirtualBox:~/Desktop/mmrm$ R -d "valgrind --malloc-fill=0x0 --free-fill=0x0 --freelist-vol=1024" --vanilla < mmrm-test.R 
==5305== Memcheck, a memory error detector
==5305== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==5305== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==5305== Command: /usr/lib/R/bin/exec/R --vanilla
==5305== 
==5305== Warning: --freelist-big-blocks value 1000000 has no effect
==5305== as it is >= to --freelist-vol value 1024

R version 4.4.1 (2024-06-14) -- "Race for Your Life"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(rbmi)
> library(mmrm)
> library(digest)
> mmrmFit <- mmrm(CHANGE~BASVAL*VISIT+THERAPY*VISIT+us(VISIT|PATIENT), data=antidepressant_data,
+                 reml=TRUE)
> coef(mmrmFit)
          (Intercept)                BASVAL                VISIT5 
           3.38611063           -0.27951009           -2.00084657 
               VISIT6                VISIT7        THERAPYPLACEBO 
          -2.70645552           -5.18326700           -0.09180645 
        BASVAL:VISIT5         BASVAL:VISIT6         BASVAL:VISIT7 
          -0.03439031           -0.11506873           -0.04678931 
VISIT5:THERAPYPLACEBO VISIT6:THERAPYPLACEBO VISIT7:THERAPYPLACEBO 
           1.49501234            2.31644127            2.89357908 
> print(digest(coef(mmrmFit)))
[1] "4688b0fad363244c2e9fd97d173fe863"
> 
==5305== 
==5305== HEAP SUMMARY:
==5305==     in use at exit: 218,650,830 bytes in 47,487 blocks
==5305==   total heap usage: 6,922,908 allocs, 6,875,421 frees, 755,780,985 bytes allocated
==5305== 
==5305== LEAK SUMMARY:
==5305==    definitely lost: 0 bytes in 0 blocks
==5305==    indirectly lost: 0 bytes in 0 blocks
==5305==      possibly lost: 0 bytes in 0 blocks
==5305==    still reachable: 218,650,830 bytes in 47,487 blocks
==5305==                       of which reachable via heuristic:
==5305==                         newarray           : 4,264 bytes in 1 blocks
==5305==         suppressed: 0 bytes in 0 blocks
==5305== Rerun with --leak-check=full to see details of leaked memory
==5305== 
==5305== For lists of detected and suppressed errors, rerun with: -s
==5305== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
danielinteractive commented 2 months ago

Thanks a lot @tobiasmuetze , @luwidmer and @baillem!

If you have any ideas how we could debug this further and find the root cause (and then ultimately fix it but that is a long way to go I think) that would be great!

bailliem commented 2 months ago

Thanks a lot @tobiasmuetze , @luwidmer and @baillem!

If you have any ideas how we could debug this further and find the root cause (and then ultimately fix it but that is a long way to go I think) that would be great!

Hi @danielinteractive , happy to set up a call to walk you and / or the {mmrm} developers through some of our testing, etc. A lot of the initial debugging was based on our internal cluster so has been a challenge to give you the full details. But happy to screen share and walk you through it. This might help with tracking it down further and potential solutions.

Would that work? If so, who would want to join the meeting?

danielinteractive commented 2 months ago

Thanks a lot @bailliem, sounds great, I think for the beginning I can join the meeting, maybe @clarkliming also has time - you can send me an invite to daniel@rconis.com

luwidmer commented 2 months ago

@danielinteractive @tobiasmuetze @bailliem I did some more digging using this docker image: https://github.com/wch/r-debug and I think I may have found the offending line (unless this is a false positive?) https://github.com/openpharma/mmrm/blob/2de9c871c190dfbec1358b8ef6709c234b2cdad1/src/mmrm.cpp#L104

With valgrind instrumentation in R, I get:

root@6aca8b552b91:/# RDvalgrind -d "valgrind --tool=memcheck --leak-check=full --track-origins=yes --show-leak-kinds=definite" --vanilla < test.R
==3954== Memcheck, a memory error detector
==3954== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==3954== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==3954== Command: /usr/local/RDvalgrind/lib/R/bin/exec/R --vanilla
==3954== 

R Under development (unstable) (2024-08-16 r87025) -- "Unsuffered Consequences"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(mmrm)
> library(digest)
> options(digits = 22)
> 
> mmrmFit <- mmrm(
+   formula = FEV1 ~ RACE + SEX + ARMCD * AVISIT + us(AVISIT | USUBJID),
+   data = fev_data
+ )
==3954== Conditional jump or move depends on uninitialised value(s)
==3954==    at 0x2088BACF: void Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>::_solve_impl_transposed<true, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >(Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> const&, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>&) const (LDLT.h:598)
==3954==    by 0x20887C56: void Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>::_solve_impl<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >(Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> const&, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>&) const (LDLT.h:569)
==3954==    by 0x20883B9E: Eigen::internal::Assignment<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >, Eigen::internal::assign_op<TMBad::global::ad_aug, TMBad::global::ad_aug>, Eigen::internal::Dense2Dense, void>::run(Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>&, Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > const&, Eigen::internal::assign_op<TMBad::global::ad_aug, TMBad::global::ad_aug> const&) (Solve.h:147)
==3954==    by 0x2088036A: void Eigen::internal::call_assignment_no_alias<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >, Eigen::internal::assign_op<TMBad::global::ad_aug, TMBad::global::ad_aug> >(Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>&, Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > const&, Eigen::internal::assign_op<TMBad::global::ad_aug, TMBad::global::ad_aug> const&) (AssignEvaluator.h:890)
==3954==    by 0x2087B5CB: Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>& Eigen::PlainObjectBase<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >::_set_noalias<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > >(Eigen::DenseBase<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > > const&) (PlainObjectBase.h:797)
==3954==    by 0x20875D24: void Eigen::PlainObjectBase<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >::_init1<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >, Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > >(Eigen::DenseBase<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > > const&) (PlainObjectBase.h:883)
==3954==    by 0x20870FDE: Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>::Matrix<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > >(Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > const&) (Matrix.h:332)
==3954==    by 0x2086D3E0: tmbutils::matrix<TMBad::global::ad_aug>::matrix<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > >(Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >) (vector.hpp:106)
==3954==    by 0x20869038: objective_function<TMBad::global::ad_aug>::operator()() (mmrm.cpp:104)
==3954==    by 0x207174B9: objective_function<TMBad::global::ad_aug>::evalUserTemplate() (tmb_core.hpp:913)
==3954==    by 0x20978AD6: MakeADFunObject_(SEXPREC*, SEXPREC*, SEXPREC*, SEXPREC*, int, SEXPREC*&) (tmb_core.hpp:1273)
==3954==    by 0x20978F16: MakeADFunObject (tmb_core.hpp:1411)
==3954==  Uninitialised value was created by a stack allocation
==3954==    at 0x206FC590: ??? (in /usr/local/RDvalgrind/lib/R/site-library/mmrm/libs/mmrm.so)
==3954== 
> coef(mmrmFit)
                  (Intercept) RACEBlack or African American 
   30.77747548289312362612691     1.53049977079201293683752 
                    RACEWhite                     SEXFemale 
    5.64356534807878595927377     0.32606192204036654747057 
                     ARMCDTRT                    AVISITVIS2 
    3.77423004340706746972955     4.83958845465665277174594 
                   AVISITVIS3                    AVISITVIS4 
   10.34211288324631361490447    15.05389826268926611874122 
          ARMCDTRT:AVISITVIS2           ARMCDTRT:AVISITVIS3 
   -0.04192625024339907691129    -0.69368537001187691171111 
          ARMCDTRT:AVISITVIS4 
    0.62422703246844468694121 
> digest(coef(mmrmFit))
[1] "0aadcaf025259096c2a369411fd935bb"
> 
==3954== 
==3954== HEAP SUMMARY:
==3954==     in use at exit: 215,943,732 bytes in 43,628 blocks
==3954==   total heap usage: 144,872 allocs, 101,244 frees, 371,752,067 bytes allocated
==3954== 
==3954== LEAK SUMMARY:
==3954==    definitely lost: 0 bytes in 0 blocks
==3954==    indirectly lost: 0 bytes in 0 blocks
==3954==      possibly lost: 0 bytes in 0 blocks
==3954==    still reachable: 215,943,732 bytes in 43,628 blocks
==3954==                       of which reachable via heuristic:
==3954==                         newarray           : 4,264 bytes in 1 blocks
==3954==         suppressed: 0 bytes in 0 blocks
==3954== Reachable blocks (those to which a pointer was found) are not shown.
==3954== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==3954== 
==3954== For lists of detected and suppressed errors, rerun with: -s
==3954== ERROR SUMMARY: 22 errors from 1 contexts (suppressed: 0 from 0)
danielinteractive commented 2 months ago

That is great progress, thanks @luwidmer !

Now I just need to figure out if there is a "more" correct way to use Eigen here (e.g. looking at https://eigen.tuxfamily.org/dox-devel/classEigen_1_1LDLT.html#a8747c1a709736527fd76785459fc827e I am not so sure), or to understand whether this is a problem inside Eigen.

Looking at https://eigen.tuxfamily.org/dox/LDLT_8h_source.html and the call stack from valgrind, my first guess would be that the problem is in the line if(abs(vecD(i)) > tolerance) (here on line 595, but seems in the code version used in the valgrind run everything is shifted by 3 lines) because either (I think likely) vecD has not been initialized by the construction in const typename Diagonal<const MatrixType>::RealReturnType vecD(vectorD()); or (less likely I would think) because tolerance has not been initialized by RealScalar tolerance = (std::numeric_limits<RealScalar>::min)(); (matching the valgrind reported problem "Conditional jump or move depends on uninitialised value(s)")

@ggael Apologies for the question out of nowhere, but any idea if something could be wrong in Eigen here? Or whether I am using Eigen wrongly such that it uses uninitialized values?

luwidmer commented 2 months ago

@danielinteractive I added some debug print statements in https://github.com/luwidmer/mmrm/tree/issue-mmrm-valgrind (in src/mmrm.cpp) before that line, outputting XtWX and XtWY. It seems the standard calls are fine, it's just the one with TMB in the middle (I presume for the gradients?) that has the valgrind uninitialized memory issue. So probably TMB in conjunction with Eigen, not Eigen itself? See below for an example output:

root@6aca8b552b91:/# RDvalgrind -d "valgrind --tool=memcheck --leak-check=full --track-origins=yes --show-leak-kinds=definite"
==5702== Memcheck, a memory error detector
==5702== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==5702== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==5702== Command: /usr/local/RDvalgrind/lib/R/bin/exec/R
==5702== 

R Under development (unstable) (2024-08-16 r87025) -- "Unsuffered Consequences"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(mmrm)
options(digits = 22)

mmrmFit <- mmrm(
  formula = FEV1 ~ RACE + SEX + ARMCD * AVISIT + us(AVISIT | USUBJID),
  data = fev_data
)
Value of XtWX : 537   0   0   0   0   0   0   0   0   0   0
196 196   0   0   0   0   0   0   0   0   0
141   0 141   0   0   0   0   0   0   0   0
288  94  73 288   0   0   0   0   0   0   0
262  76  87 146 262   0   0   0   0   0   0
140  50  36  75  71 140   0   0   0   0   0
129  42  36  71  58   0 129   0   0   0   0
134  53  37  70  67   0   0 134   0   0   0
 71  20  22  41  71  71   0   0  71   0   0
 58  13  21  35  58   0  58   0   0  58   0
 67  23  23  36  67   0   0  67   0   0  67
Value of XtWY : 22712.9
 8019.4
6608.92
12214.5
  11646
5567.14
5784.75
6723.86
2976.85
2718.02
3523.72
Value of XtWX : 537   0   0   0   0   0   0   0   0   0   0
196 196   0   0   0   0   0   0   0   0   0
141   0 141   0   0   0   0   0   0   0   0
288  94  73 288   0   0   0   0   0   0   0
262  76  87 146 262   0   0   0   0   0   0
140  50  36  75  71 140   0   0   0   0   0
129  42  36  71  58   0 129   0   0   0   0
134  53  37  70  67   0   0 134   0   0   0
 71  20  22  41  71  71   0   0  71   0   0
 58  13  21  35  58   0  58   0   0  58   0
 67  23  23  36  67   0   0  67   0   0  67
Value of XtWY : 22712.9
 8019.4
6608.92
12214.5
  11646
5567.14
5784.75
6723.86
2976.85
2718.02
3523.72
Value of XtWX : 537   0   0   0   0   0   0   0   0   0   0
196 196   0   0   0   0   0   0   0   0   0
141   0 141   0   0   0   0   0   0   0   0
288  94  73 288   0   0   0   0   0   0   0
262  76  87 146 262   0   0   0   0   0   0
140  50  36  75  71 140   0   0   0   0   0
129  42  36  71  58   0 129   0   0   0   0
134  53  37  70  67   0   0 134   0   0   0
 71  20  22  41  71  71   0   0  71   0   0
 58  13  21  35  58   0  58   0   0  58   0
 67  23  23  36  67   0   0  67   0   0  67
Value of XtWY : 22712.9
 8019.4
6608.92
12214.5
  11646
5567.14
5784.75
6723.86
2976.85
2718.02
3523.72
Value of XtWX :                                         {value=537, index=39817, tape=0x12f85ef0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}
                                        {value=196, index=39818, tape=0x12f85ef0}                                         {value=196, index=39821, tape=0x12f85ef0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}
                                        {value=141, index=39289, tape=0x12f85ef0}                                         {const=0}                                         {value=141, index=39296, tape=0x12f85ef0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}
                                        {value=288, index=39290, tape=0x12f85ef0}                                         {value=94, index=37962, tape=0x12f85ef0}                                         {value=73, index=39297, tape=0x12f85ef0}                                         {value=288, index=39303, tape=0x12f85ef0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}
                                        {value=262, index=39646, tape=0x12f85ef0}                                         {value=76, index=39652, tape=0x12f85ef0}                                         {value=87, index=39298, tape=0x12f85ef0}                                         {value=146, index=39304, tape=0x12f85ef0}                                         {value=262, index=39657, tape=0x12f85ef0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}
                                        {value=140, index=39819, tape=0x12f85ef0}                                         {value=50, index=39822, tape=0x12f85ef0}                                         {value=36, index=39299, tape=0x12f85ef0}                                         {value=75, index=39305, tape=0x12f85ef0}                                         {value=71, index=39658, tape=0x12f85ef0}                                         {value=140, index=39824, tape=0x12f85ef0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}
                                        {value=129, index=39293, tape=0x12f85ef0}                                         {value=42, index=39059, tape=0x12f85ef0}                                         {value=36, index=39300, tape=0x12f85ef0}                                         {value=71, index=39306, tape=0x12f85ef0}                                         {value=58, index=39311, tape=0x12f85ef0}                                         {value=-0, index=39315, tape=0x12f85ef0}                                         {value=129, index=39318, tape=0x12f85ef0}                                         {const=0}                                         {const=0}                                         {const=0}                                         {const=0}
                                        {value=134, index=39820, tape=0x12f85ef0}                                         {value=53, index=39823, tape=0x12f85ef0}                                         {value=37, index=38604, tape=0x12f85ef0}                                         {value=70, index=38612, tape=0x12f85ef0}                                         {value=67, index=39659, tape=0x12f85ef0}                                         {value=0, index=39825, tape=0x12f85ef0}                                         {value=-0, index=39065, tape=0x12f85ef0}                                         {value=134, index=39826, tape=0x12f85ef0}                                         {const=0}                                         {const=0}                                         {const=0}
                                        {value=71, index=39649, tape=0x12f85ef0}                                         {value=20, index=39655, tape=0x12f85ef0}                                         {value=22, index=39301, tape=0x12f85ef0}                                         {value=41, index=39307, tape=0x12f85ef0}                                         {value=71, index=39660, tape=0x12f85ef0}                                         {value=71, index=39664, tape=0x12f85ef0}                                         {value=-0, index=39319, tape=0x12f85ef0}                                         {value=0, index=39667, tape=0x12f85ef0}                                         {value=71, index=39669, tape=0x12f85ef0}                                         {const=0}                                         {const=0}
                                        {value=58, index=39295, tape=0x12f85ef0}                                         {value=13, index=37968, tape=0x12f85ef0}                                         {value=21, index=39302, tape=0x12f85ef0}                                         {value=35, index=39308, tape=0x12f85ef0}                                         {value=58, index=39313, tape=0x12f85ef0}                                         {value=-0, index=39317, tape=0x12f85ef0}                                         {value=58, index=39320, tape=0x12f85ef0}                                         {value=-0, index=38636, tape=0x12f85ef0}                                         {value=-0, index=39322, tape=0x12f85ef0}                                         {value=58, index=39323, tape=0x12f85ef0}                                         {const=0}
                                        {value=67, index=39650, tape=0x12f85ef0}                                         {value=23, index=39656, tape=0x12f85ef0}                                         {value=23, index=38607, tape=0x12f85ef0}                                         {value=36, index=38615, tape=0x12f85ef0}                                         {value=67, index=39661, tape=0x12f85ef0}                                         {value=0, index=39665, tape=0x12f85ef0}                                         {value=-0, index=38633, tape=0x12f85ef0}                                         {value=67, index=39668, tape=0x12f85ef0}                                         {value=0, index=39670, tape=0x12f85ef0}                                         {value=-0, index=38642, tape=0x12f85ef0}                                         {value=67, index=39671, tape=0x12f85ef0}
Value of XtWY :                                             {value=22712.9, index=39841, tape=0x12f85ef0}
                                            {value=8019.4, index=39842, tape=0x12f85ef0}
                                            {value=6608.92, index=39345, tape=0x12f85ef0}
                                            {value=12214.5, index=39346, tape=0x12f85ef0}
                                            {value=11646, index=39697, tape=0x12f85ef0}
                                            {value=5567.14, index=39843, tape=0x12f85ef0}
                                            {value=5784.75, index=39349, tape=0x12f85ef0}
                                            {value=6723.86, index=39844, tape=0x12f85ef0}
                                            {value=2976.85, index=39700, tape=0x12f85ef0}
                                            {value=2718.02, index=39351, tape=0x12f85ef0}
                                            {value=3523.72, index=39701, tape=0x12f85ef0}
==5702== Conditional jump or move depends on uninitialised value(s)
==5702==    at 0x2048D5EF: void Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>::_solve_impl_transposed<true, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >(Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> const&, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>&) const (LDLT.h:598)
==5702==    by 0x20489776: void Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>::_solve_impl<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >(Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> const&, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>&) const (LDLT.h:569)
==5702==    by 0x204856BE: Eigen::internal::Assignment<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >, Eigen::internal::assign_op<TMBad::global::ad_aug, TMBad::global::ad_aug>, Eigen::internal::Dense2Dense, void>::run(Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>&, Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > const&, Eigen::internal::assign_op<TMBad::global::ad_aug, TMBad::global::ad_aug> const&) (Solve.h:147)
==5702==    by 0x20481E8A: void Eigen::internal::call_assignment_no_alias<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >, Eigen::internal::assign_op<TMBad::global::ad_aug, TMBad::global::ad_aug> >(Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>&, Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > const&, Eigen::internal::assign_op<TMBad::global::ad_aug, TMBad::global::ad_aug> const&) (AssignEvaluator.h:890)
==5702==    by 0x2047D0DB: Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>& Eigen::PlainObjectBase<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >::_set_noalias<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > >(Eigen::DenseBase<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > > const&) (PlainObjectBase.h:797)
==5702==    by 0x20477826: void Eigen::PlainObjectBase<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >::_init1<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >, Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > >(Eigen::DenseBase<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > > const&) (PlainObjectBase.h:883)
==5702==    by 0x20472AC2: Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>::Matrix<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > >(Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > const&) (Matrix.h:332)
==5702==    by 0x2046E104: tmbutils::matrix<TMBad::global::ad_aug>::matrix<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > >(Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >) (vector.hpp:106)
==5702==    by 0x20469548: objective_function<TMBad::global::ad_aug>::operator()() (mmrm.cpp:106)
==5702==    by 0x20317619: objective_function<TMBad::global::ad_aug>::evalUserTemplate() (tmb_core.hpp:913)
==5702==    by 0x2057A5D8: MakeADFunObject_(SEXPREC*, SEXPREC*, SEXPREC*, SEXPREC*, int, SEXPREC*&) (tmb_core.hpp:1273)
==5702==    by 0x2057AA18: MakeADFunObject (tmb_core.hpp:1411)
==5702==  Uninitialised value was created by a stack allocation
==5702==    at 0x202FC6E0: ??? (in /usr/local/RDvalgrind/lib/R/site-library/mmrm/libs/mmrm.so)
==5702== 
Value of XtWX :    14.4658          0          0          0          0          0          0          0          0          0          0
   5.05963    5.05963          0          0          0          0          0          0          0          0          0
   3.93859          0    3.93859          0          0          0          0          0          0          0          0
   7.82615    2.44471    2.07845    7.82615          0          0          0          0          0          0          0
   6.80339    1.80354     2.3626    3.91608    6.80339          0          0          0          0          0          0
   4.03578    1.45565      1.046    2.17689    2.04173    6.20055          0          0          0          0          0
   7.88694    2.57777    2.21301    4.33629    3.52782  -0.442188    8.95922          0          0          0          0
   1.00498   0.387613   0.286653    0.52976   0.494276  -0.183797  0.0262879     1.4578          0          0          0
   2.04173   0.585012   0.639344    1.18372    2.04173    3.15502  -0.219541 -0.0931101    3.15502          0          0
   3.52782   0.794072    1.28587    2.13015    3.52782  -0.219541    4.03424   0.015466  -0.219541    4.03424          0
  0.494276   0.169217   0.172796   0.268239   0.494276 -0.0931101   0.015466   0.730506 -0.0931101   0.015466   0.730506
Value of XtWY : 617.424
 210.14
184.496
335.669
304.022
163.388
362.037
56.9573
86.9934
169.172
29.5552
Value of XtWX : 537   0   0   0   0   0   0   0   0   0   0
196 196   0   0   0   0   0   0   0   0   0
141   0 141   0   0   0   0   0   0   0   0
288  94  73 288   0   0   0   0   0   0   0
262  76  87 146 262   0   0   0   0   0   0
140  50  36  75  71 140   0   0   0   0   0
129  42  36  71  58   0 129   0   0   0   0
134  53  37  70  67   0   0 134   0   0   0
 71  20  22  41  71  71   0   0  71   0   0
 58  13  21  35  58   0  58   0   0  58   0
 67  23  23  36  67   0   0  67   0   0  67
Value of XtWY : 22712.9
 8019.4
6608.92
12214.5
  11646
5567.14
5784.75
6723.86
2976.85
2718.02
3523.72
danielinteractive commented 2 months ago

Thanks a lot @luwidmer ! Interesting. I think we might need to step through the code execution interactively (R and maybe also gdb debuggers) to find the problematic point where something is not yet initialized. I am downloading Docker and the image now as well and we can talk tomorrow about the details. Thanks again!

danielinteractive commented 2 months ago

I now confirmed (after about 5 hours of work) that indeed RealScalar tolerance = (std::numeric_limits<RealScalar>::min)(); is the problem, somehow in an interplay between the RealScalar type used by TMB (which is TMBad::global::ad_aug) and the standard template library this causes an uninitialized value. Hardcoding this to not be used as I did in https://github.com/danielinteractive/RcppEigen/tree/mem-debug then the memory issue disappears. That is not the solution of course, therefore I will need more time to

(fyi @kaskr @gowerc)

luwidmer commented 2 months ago

Edited to due user error (my bad!).

@danielinteractive @kaskr I just installed the master branch (05f36a7) from kaskr/adcomp (cloned that branch of the repo, changed the version number and the Makefile to use RDvalgrind as the command for R, then make install), and recompiled mmrm (install.packages from your favorite mirror), and now the use of uninitialized values is indeed fixed:

root@6aca8b552b91:/# RDvalgrind -d "valgrind --tool=memcheck --leak-check=full --track-origins=yes --show-leak-kinds=definite" --vanilla < test.R
==15269== Memcheck, a memory error detector
==15269== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==15269== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==15269== Command: /usr/local/RDvalgrind/lib/R/bin/exec/R --vanilla
==15269== 

R Under development (unstable) (2024-08-16 r87025) -- "Unsuffered Consequences"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(mmrm)
> library(digest)
> options(digits = 22)
> 
> mmrmFit <- mmrm(
+   formula = FEV1 ~ RACE + SEX + ARMCD * AVISIT + us(AVISIT | USUBJID),
+   data = fev_data
+ )
> coef(mmrmFit)
                  (Intercept) RACEBlack or African American 
   30.77747548289312362612691     1.53049977079201293683752 
                    RACEWhite                     SEXFemale 
    5.64356534807878595927377     0.32606192204036654747057 
                     ARMCDTRT                    AVISITVIS2 
    3.77423004340706746972955     4.83958845465665277174594 
                   AVISITVIS3                    AVISITVIS4 
   10.34211288324631361490447    15.05389826268926611874122 
          ARMCDTRT:AVISITVIS2           ARMCDTRT:AVISITVIS3 
   -0.04192625024339907691129    -0.69368537001187691171111 
          ARMCDTRT:AVISITVIS4 
    0.62422703246844468694121 
> digest(coef(mmrmFit))
[1] "0aadcaf025259096c2a369411fd935bb"
> 
==15269== 
==15269== HEAP SUMMARY:
==15269==     in use at exit: 215,981,441 bytes in 43,641 blocks
==15269==   total heap usage: 145,056 allocs, 101,415 frees, 372,654,234 bytes allocated
==15269== 
==15269== LEAK SUMMARY:
==15269==    definitely lost: 0 bytes in 0 blocks
==15269==    indirectly lost: 0 bytes in 0 blocks
==15269==      possibly lost: 0 bytes in 0 blocks
==15269==    still reachable: 215,981,441 bytes in 43,641 blocks
==15269==                       of which reachable via heuristic:
==15269==                         newarray           : 4,264 bytes in 1 blocks
==15269==         suppressed: 0 bytes in 0 blocks
==15269== Reachable blocks (those to which a pointer was found) are not shown.
==15269== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==15269== 
==15269== For lists of detected and suppressed errors, rerun with: -s
==15269== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
danielinteractive commented 2 months ago

@luwidmer , yep, I just confirmed here on my end too that the current master branch of TMB fixes the problem regarding the memory issue. Can you maybe double check with one or two spot check runs that this also fixes the reproducibility issue?

luwidmer commented 2 months ago

@danielinteractive will fire it up on the compute cluster

luwidmer commented 2 months ago

Just running the example above in a loop (running Rscript) on my Mac before:

Output ``` > source("looptest.R") [1] "bbd8cdc636f5f168679847560dcc2d14" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "bbd8cdc636f5f168679847560dcc2d14" [1] "bbd8cdc636f5f168679847560dcc2d14" [1] "bbd8cdc636f5f168679847560dcc2d14" [1] "903104ec4747c253d773e48e6ab96b32" [1] "bbd8cdc636f5f168679847560dcc2d14" [1] "903104ec4747c253d773e48e6ab96b32" [1] "bbd8cdc636f5f168679847560dcc2d14" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "bbd8cdc636f5f168679847560dcc2d14" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "bbd8cdc636f5f168679847560dcc2d14" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "bbd8cdc636f5f168679847560dcc2d14" [1] "903104ec4747c253d773e48e6ab96b32" [1] "bbd8cdc636f5f168679847560dcc2d14" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "bbd8cdc636f5f168679847560dcc2d14" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "bbd8cdc636f5f168679847560dcc2d14" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "bbd8cdc636f5f168679847560dcc2d14" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "bbd8cdc636f5f168679847560dcc2d14" [1] "903104ec4747c253d773e48e6ab96b32" [1] "bbd8cdc636f5f168679847560dcc2d14" [1] "903104ec4747c253d773e48e6ab96b32" [1] "bed32859801b034984cc36978d0114e3" [1] "903104ec4747c253d773e48e6ab96b32" [1] "bbd8cdc636f5f168679847560dcc2d14" [1] "5dc9f3459f51ce0e634c0d795a8f264d" [1] "bbd8cdc636f5f168679847560dcc2d14" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "bbd8cdc636f5f168679847560dcc2d14" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "bbd8cdc636f5f168679847560dcc2d14" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" ```

And after:

Output ``` > source("looptest.R") [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" [1] "903104ec4747c253d773e48e6ab96b32" ```
luwidmer commented 2 months ago

@danielinteractive I'm afraid I have some bad news. I re-ran this example in a loop:

i <- 0
while(TRUE) {
  i <- i + 1
  system("Rscript mmrm-patch-tmb.R")
  print(i)
  Sys.sleep(0.05)
}

And only reported the hash of the output if it's different:

library(mmrm)
library(digest)
options(digits = 22)

mmrmFit <- mmrm(
  formula = FEV1 ~ RACE + SEX + ARMCD * AVISIT + us(AVISIT | USUBJID),
  data = fev_data
)

res <- digest(coef(mmrmFit))

if (res != "903104ec4747c253d773e48e6ab96b32") {
  print(res)
}

This yielded

[1] 810
[1] 811
[1] 812
[1] 813
[1] "9d3fdf45b3ef227cb27af3871a0ca0f7"
[1] 814
[1] 815
[1] 816

So there seems to be roughly 1 in 800 chance of getting a different result after applying the patch (much better than before, but still not fully deterministic).

danielinteractive commented 2 months ago

@luwidmer Interesting, how was this rate before, much higher? And if you run this in RDvalgrind does it show any memory issues?

luwidmer commented 2 months ago

@danielinteractive it was much higher before, see https://github.com/openpharma/mmrm/issues/462#issuecomment-2324394363 (unfold the code blocks). Unfortunately no more memory issues detected, though this could still be a race condition or gc-related issue I guess?

danielinteractive commented 2 months ago

I see, ok, so really quite a nice improvement. But of course we want 100% reproducibility 😄

How could it be gc-related? I mean you run this in a new R session each time?

And if you run instead of Rscript the RDvalgrind equivalent? and save the log? maybe it could show something for this rare case?

luwidmer commented 2 months ago

Memory pressure on the system might be different at different times, though I agree this is unlikely to be the cause. I wonder if RDcsan with the thread sanitizer built in might find something, but if that doesn't fly we could indeed try RDvalgrind in a loop until the problem manifests...?

danielinteractive commented 2 months ago

I would expect that memory pressure should not change the results, or is that a known problem in other cases with R? Like running a linear model fit will always give the same result or not?

Yeah maybe you can run this other sanitizer...? Or RDvalgrind with even more sensitive checks...?

danielinteractive commented 2 months ago

@luwidmer I also wonder whether this is just on one platform now or whether you also see this still on other platforms? If you see this on another platform, I can also try to reproduce this here on my laptop.

Just not sure what would be next ... somehow it would be cool to be able to "fix" the expected internal computation results and then immediately signal when one of them is not the same as before... but I guess that is not really possible.

luwidmer commented 2 months ago

@danielinteractive I see this both on RHEL (R 4.3.1, x64, GCC) and MacOS (R 4.4.1, ARM64, AppleClang), so it is definitely still cross-platform (and cross-compiler + cross-BLAS)

kaskr commented 2 months ago

Sorry, didn't see this new problem until now. It's most likely due to the TMB tape optimizer not being completely deterministic. To verify, you can disable the tape optimizer by:

TMB::config(optimize.instantly=0, DLL="mmrm")

There's an internal flag to make the tape optimizer deterministic at some extra computational cost. I'll add this option to TMB sometime soon and probably change the default setting.

luwidmer commented 2 months ago

Sorry, didn't see this new problem until now. It's most likely due to the TMB tape optimizer not being completely deterministic. To verify, you can disable the tape optimizer by:

TMB::config(optimize.instantly=0, DLL="mmrm")

There's an internal flag to make the tape optimizer deterministic at some extra computational cost. I'll add this option to TMB sometime soon and probably change the default setting.

Added this after library(mmrm), and could not get it to be non-deterministic in 2400+ runs! @kaskr you're a genius!

danielinteractive commented 2 months ago

Thanks a lot @kaskr and @luwidmer !

Should we wait for you to change the default in TMB? Or should we from our package side add this option call in the meantime? (Because our users definitely want the reproducible results by default)

luwidmer commented 2 months ago

@danielinteractive should there also be a warning if this option is turned on (and a unit test for that), so if it's on, users actually get a notification? (same for TMB versions that use uninitialized memory?)

luwidmer commented 2 months ago

Also, I'm pretty sure we've tackled #246 with this as well :)

danielinteractive commented 2 months ago

Yeah good idea, we could definitely add warnings for these 2 things (TMB version, TMB option)

kaskr commented 2 months ago

I've changed the default in TMB to ensure numerically reproducible results. An option is added to control() the behavior, but that is mainly for testing and might be removed at some point.

danielinteractive commented 2 months ago

Thanks a lot @kaskr! When do you plan the next CRAN release of TMB? (no pressure, just curious)

kaskr commented 2 months ago

@danielinteractive I'm thinking maybe in a couple of days if things seem to work.

luwidmer commented 2 months ago

@kaskr, happy to report that after some more extensive testing the release from https://github.com/kaskr/adcomp/commit/1130f2efb488a708834e4bf618a4e1a4a208ae9c still works fine and deterministically with mmrm for me :)

kaskr commented 2 months ago

@luwidmer Thanks, it is now published on CRAN.

luwidmer commented 2 months ago

Superb, many thanks @kaskr <3 !

gowerc commented 1 month ago

@danielinteractive / @luwidmer - In @kaskr's message he mentioned:

There's an internal flag to make the tape optimizer deterministic at some extra computational cost.

I'm curious as to if either of you looked into what the performance costs of this change was ? I appreciate the code should be deterministic by default but if the performance cost is significant is it worth exposing this as an option for users ? (needless to say if the performance cost is negligible then please ignore me 😄 )

luwidmer commented 1 month ago

@gowerc I didn't do an apples-to-apples comparison, but upgrading from mmrm 0.2.2 to the current version while simultaneously upgrading TMB to 1.9.15 increased speed by factor 2 for the above example (mean time over 200 runs went from 78.5 to 39 ms). Probably would be a case for a microbenchmark with matched versions if you really want to know @gowerc ?

danielinteractive commented 1 month ago

I did not do a speed comparison at all so far on this, prioritizing reliability. However, the TMB option is a user option, so users could change this from their side if they are ok to sacrifice reliability (and they will get a warning from mmrm now)

luwidmer commented 1 month ago

With 0.2.2 and current TMB, I get a 1 ms average difference (very slightly slower). Also agree with @danielinteractive , reliability before performance.

luwidmer commented 1 month ago

@danielinteractive, I guess a new mmrm release on CRAN would also cause recompilation of mmrm with the new TMB? Do you know if CRAN updates reverse dependency binaries when their version does not change?

danielinteractive commented 1 month ago

@luwidmer yeah exactly, therefore I am keen to have soon a new CRAN release. I don't think the binaries will be updated otherwise.

luwidmer commented 1 month ago

You can probably close #246 as well with that release 😄