Closed tobiasmuetze closed 1 month ago
I also ran the following example using R -d valgrind
, and interestingly, this makes it reproducible. However, the result depends on the size of the free list, which probably should not be the case (note that I computed digest(coef(mmrmFit)) here). Can also confirm the irreproducible results under Rscript on R 4.4.1 / Ubuntu 22.04 / mmrm 0.3.12, R 4.3.1 / Windows 11 / 0.3.12, R 4.4.1 / macOS / mmrm 0.3.12
Example under Rscript
:
user@user-VirtualBox:~/Desktop/mmrm$ Rscript mmrm-test.R
(Intercept) BASVAL VISIT5
3.38611063 -0.27951009 -2.00084657
VISIT6 VISIT7 THERAPYPLACEBO
-2.70645552 -5.18326700 -0.09180645
BASVAL:VISIT5 BASVAL:VISIT6 BASVAL:VISIT7
-0.03439031 -0.11506873 -0.04678931
VISIT5:THERAPYPLACEBO VISIT6:THERAPYPLACEBO VISIT7:THERAPYPLACEBO
1.49501234 2.31644127 2.89357908
[1] "2d29335948cecae8a49034d195d670c2"
user@user-VirtualBox:~/Desktop/mmrm$ Rscript mmrm-test.R
(Intercept) BASVAL VISIT5
3.38611063 -0.27951009 -2.00084657
VISIT6 VISIT7 THERAPYPLACEBO
-2.70645552 -5.18326700 -0.09180645
BASVAL:VISIT5 BASVAL:VISIT6 BASVAL:VISIT7
-0.03439031 -0.11506873 -0.04678931
VISIT5:THERAPYPLACEBO VISIT6:THERAPYPLACEBO VISIT7:THERAPYPLACEBO
1.49501234 2.31644127 2.89357908
[1] "5acb56f26319a79ce756800c4bff6071"
Example under R -d "valgrind --malloc-fill=0x0 --free-fill=0x0"
user@user-VirtualBox:~/Desktop/mmrm$ R -d "valgrind --malloc-fill=0x0 --free-fill=0x0" --vanilla < mmrm-test.R
==5173== Memcheck, a memory error detector
==5173== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==5173== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==5173== Command: /usr/lib/R/bin/exec/R --vanilla
==5173==
R version 4.4.1 (2024-06-14) -- "Race for Your Life"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(rbmi)
> library(mmrm)
> library(digest)
> mmrmFit <- mmrm(CHANGE~BASVAL*VISIT+THERAPY*VISIT+us(VISIT|PATIENT), data=antidepressant_data,
+ reml=TRUE)
> coef(mmrmFit)
(Intercept) BASVAL VISIT5
3.38611063 -0.27951009 -2.00084657
VISIT6 VISIT7 THERAPYPLACEBO
-2.70645552 -5.18326700 -0.09180645
BASVAL:VISIT5 BASVAL:VISIT6 BASVAL:VISIT7
-0.03439031 -0.11506873 -0.04678931
VISIT5:THERAPYPLACEBO VISIT6:THERAPYPLACEBO VISIT7:THERAPYPLACEBO
1.49501234 2.31644127 2.89357908
> print(digest(coef(mmrmFit)))
[1] "e56c10b6917ed660479cd1e6edc37d0f"
>
==5173==
==5173== HEAP SUMMARY:
==5173== in use at exit: 218,650,830 bytes in 47,487 blocks
==5173== total heap usage: 6,922,908 allocs, 6,875,421 frees, 755,780,985 bytes allocated
==5173==
==5173== LEAK SUMMARY:
==5173== definitely lost: 0 bytes in 0 blocks
==5173== indirectly lost: 0 bytes in 0 blocks
==5173== possibly lost: 0 bytes in 0 blocks
==5173== still reachable: 218,650,830 bytes in 47,487 blocks
==5173== of which reachable via heuristic:
==5173== newarray : 4,264 bytes in 1 blocks
==5173== suppressed: 0 bytes in 0 blocks
==5173== Rerun with --leak-check=full to see details of leaked memory
==5173==
==5173== For lists of detected and suppressed errors, rerun with: -s
==5173== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
user@user-VirtualBox:~/Desktop/mmrm$ R -d "valgrind --malloc-fill=0x0 --free-fill=0x0" --vanilla < mmrm-test.R
==5190== Memcheck, a memory error detector
==5190== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==5190== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==5190== Command: /usr/lib/R/bin/exec/R --vanilla
==5190==
R version 4.4.1 (2024-06-14) -- "Race for Your Life"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(rbmi)
> library(mmrm)
> library(digest)
> mmrmFit <- mmrm(CHANGE~BASVAL*VISIT+THERAPY*VISIT+us(VISIT|PATIENT), data=antidepressant_data,
+ reml=TRUE)
> coef(mmrmFit)
(Intercept) BASVAL VISIT5
3.38611063 -0.27951009 -2.00084657
VISIT6 VISIT7 THERAPYPLACEBO
-2.70645552 -5.18326700 -0.09180645
BASVAL:VISIT5 BASVAL:VISIT6 BASVAL:VISIT7
-0.03439031 -0.11506873 -0.04678931
VISIT5:THERAPYPLACEBO VISIT6:THERAPYPLACEBO VISIT7:THERAPYPLACEBO
1.49501234 2.31644127 2.89357908
> print(digest(coef(mmrmFit)))
[1] "e56c10b6917ed660479cd1e6edc37d0f"
>
==5190==
==5190== HEAP SUMMARY:
==5190== in use at exit: 218,650,830 bytes in 47,487 blocks
==5190== total heap usage: 6,922,908 allocs, 6,875,421 frees, 755,780,985 bytes allocated
==5190==
==5190== LEAK SUMMARY:
==5190== definitely lost: 0 bytes in 0 blocks
==5190== indirectly lost: 0 bytes in 0 blocks
==5190== possibly lost: 0 bytes in 0 blocks
==5190== still reachable: 218,650,830 bytes in 47,487 blocks
==5190== of which reachable via heuristic:
==5190== newarray : 4,264 bytes in 1 blocks
==5190== suppressed: 0 bytes in 0 blocks
==5190== Rerun with --leak-check=full to see details of leaked memory
==5190==
==5190== For lists of detected and suppressed errors, rerun with: -s
==5190== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Example under R -d "valgrind --malloc-fill=0x0 --free-fill=0x0 --freelist-vol=1024"
user@user-VirtualBox:~/Desktop/mmrm$ R -d "valgrind --malloc-fill=0x0 --free-fill=0x0 --freelist-vol=1024" --vanilla < mmrm-test.R
==5290== Memcheck, a memory error detector
==5290== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==5290== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==5290== Command: /usr/lib/R/bin/exec/R --vanilla
==5290==
==5290== Warning: --freelist-big-blocks value 1000000 has no effect
==5290== as it is >= to --freelist-vol value 1024
R version 4.4.1 (2024-06-14) -- "Race for Your Life"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(rbmi)
> library(mmrm)
> library(digest)
> mmrmFit <- mmrm(CHANGE~BASVAL*VISIT+THERAPY*VISIT+us(VISIT|PATIENT), data=antidepressant_data,
+ reml=TRUE)
> coef(mmrmFit)
(Intercept) BASVAL VISIT5
3.38611063 -0.27951009 -2.00084657
VISIT6 VISIT7 THERAPYPLACEBO
-2.70645552 -5.18326700 -0.09180645
BASVAL:VISIT5 BASVAL:VISIT6 BASVAL:VISIT7
-0.03439031 -0.11506873 -0.04678931
VISIT5:THERAPYPLACEBO VISIT6:THERAPYPLACEBO VISIT7:THERAPYPLACEBO
1.49501234 2.31644127 2.89357908
> print(digest(coef(mmrmFit)))
[1] "4688b0fad363244c2e9fd97d173fe863"
>
==5290==
==5290== HEAP SUMMARY:
==5290== in use at exit: 218,650,830 bytes in 47,487 blocks
==5290== total heap usage: 6,922,908 allocs, 6,875,421 frees, 755,780,985 bytes allocated
==5290==
==5290== LEAK SUMMARY:
==5290== definitely lost: 0 bytes in 0 blocks
==5290== indirectly lost: 0 bytes in 0 blocks
==5290== possibly lost: 0 bytes in 0 blocks
==5290== still reachable: 218,650,830 bytes in 47,487 blocks
==5290== of which reachable via heuristic:
==5290== newarray : 4,264 bytes in 1 blocks
==5290== suppressed: 0 bytes in 0 blocks
==5290== Rerun with --leak-check=full to see details of leaked memory
==5290==
==5290== For lists of detected and suppressed errors, rerun with: -s
==5290== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
user@user-VirtualBox:~/Desktop/mmrm$ R -d "valgrind --malloc-fill=0x0 --free-fill=0x0 --freelist-vol=1024" --vanilla < mmrm-test.R
==5305== Memcheck, a memory error detector
==5305== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==5305== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==5305== Command: /usr/lib/R/bin/exec/R --vanilla
==5305==
==5305== Warning: --freelist-big-blocks value 1000000 has no effect
==5305== as it is >= to --freelist-vol value 1024
R version 4.4.1 (2024-06-14) -- "Race for Your Life"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(rbmi)
> library(mmrm)
> library(digest)
> mmrmFit <- mmrm(CHANGE~BASVAL*VISIT+THERAPY*VISIT+us(VISIT|PATIENT), data=antidepressant_data,
+ reml=TRUE)
> coef(mmrmFit)
(Intercept) BASVAL VISIT5
3.38611063 -0.27951009 -2.00084657
VISIT6 VISIT7 THERAPYPLACEBO
-2.70645552 -5.18326700 -0.09180645
BASVAL:VISIT5 BASVAL:VISIT6 BASVAL:VISIT7
-0.03439031 -0.11506873 -0.04678931
VISIT5:THERAPYPLACEBO VISIT6:THERAPYPLACEBO VISIT7:THERAPYPLACEBO
1.49501234 2.31644127 2.89357908
> print(digest(coef(mmrmFit)))
[1] "4688b0fad363244c2e9fd97d173fe863"
>
==5305==
==5305== HEAP SUMMARY:
==5305== in use at exit: 218,650,830 bytes in 47,487 blocks
==5305== total heap usage: 6,922,908 allocs, 6,875,421 frees, 755,780,985 bytes allocated
==5305==
==5305== LEAK SUMMARY:
==5305== definitely lost: 0 bytes in 0 blocks
==5305== indirectly lost: 0 bytes in 0 blocks
==5305== possibly lost: 0 bytes in 0 blocks
==5305== still reachable: 218,650,830 bytes in 47,487 blocks
==5305== of which reachable via heuristic:
==5305== newarray : 4,264 bytes in 1 blocks
==5305== suppressed: 0 bytes in 0 blocks
==5305== Rerun with --leak-check=full to see details of leaked memory
==5305==
==5305== For lists of detected and suppressed errors, rerun with: -s
==5305== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Thanks a lot @tobiasmuetze , @luwidmer and @baillem!
If you have any ideas how we could debug this further and find the root cause (and then ultimately fix it but that is a long way to go I think) that would be great!
Thanks a lot @tobiasmuetze , @luwidmer and @baillem!
If you have any ideas how we could debug this further and find the root cause (and then ultimately fix it but that is a long way to go I think) that would be great!
Hi @danielinteractive , happy to set up a call to walk you and / or the {mmrm} developers through some of our testing, etc. A lot of the initial debugging was based on our internal cluster so has been a challenge to give you the full details. But happy to screen share and walk you through it. This might help with tracking it down further and potential solutions.
Would that work? If so, who would want to join the meeting?
Thanks a lot @bailliem, sounds great, I think for the beginning I can join the meeting, maybe @clarkliming also has time - you can send me an invite to daniel@rconis.com
@danielinteractive @tobiasmuetze @bailliem I did some more digging using this docker image: https://github.com/wch/r-debug and I think I may have found the offending line (unless this is a false positive?) https://github.com/openpharma/mmrm/blob/2de9c871c190dfbec1358b8ef6709c234b2cdad1/src/mmrm.cpp#L104
With valgrind instrumentation in R, I get:
root@6aca8b552b91:/# RDvalgrind -d "valgrind --tool=memcheck --leak-check=full --track-origins=yes --show-leak-kinds=definite" --vanilla < test.R
==3954== Memcheck, a memory error detector
==3954== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==3954== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==3954== Command: /usr/local/RDvalgrind/lib/R/bin/exec/R --vanilla
==3954==
R Under development (unstable) (2024-08-16 r87025) -- "Unsuffered Consequences"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(mmrm)
> library(digest)
> options(digits = 22)
>
> mmrmFit <- mmrm(
+ formula = FEV1 ~ RACE + SEX + ARMCD * AVISIT + us(AVISIT | USUBJID),
+ data = fev_data
+ )
==3954== Conditional jump or move depends on uninitialised value(s)
==3954== at 0x2088BACF: void Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>::_solve_impl_transposed<true, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >(Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> const&, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>&) const (LDLT.h:598)
==3954== by 0x20887C56: void Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>::_solve_impl<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >(Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> const&, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>&) const (LDLT.h:569)
==3954== by 0x20883B9E: Eigen::internal::Assignment<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >, Eigen::internal::assign_op<TMBad::global::ad_aug, TMBad::global::ad_aug>, Eigen::internal::Dense2Dense, void>::run(Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>&, Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > const&, Eigen::internal::assign_op<TMBad::global::ad_aug, TMBad::global::ad_aug> const&) (Solve.h:147)
==3954== by 0x2088036A: void Eigen::internal::call_assignment_no_alias<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >, Eigen::internal::assign_op<TMBad::global::ad_aug, TMBad::global::ad_aug> >(Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>&, Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > const&, Eigen::internal::assign_op<TMBad::global::ad_aug, TMBad::global::ad_aug> const&) (AssignEvaluator.h:890)
==3954== by 0x2087B5CB: Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>& Eigen::PlainObjectBase<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >::_set_noalias<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > >(Eigen::DenseBase<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > > const&) (PlainObjectBase.h:797)
==3954== by 0x20875D24: void Eigen::PlainObjectBase<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >::_init1<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >, Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > >(Eigen::DenseBase<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > > const&) (PlainObjectBase.h:883)
==3954== by 0x20870FDE: Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>::Matrix<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > >(Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > const&) (Matrix.h:332)
==3954== by 0x2086D3E0: tmbutils::matrix<TMBad::global::ad_aug>::matrix<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > >(Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >) (vector.hpp:106)
==3954== by 0x20869038: objective_function<TMBad::global::ad_aug>::operator()() (mmrm.cpp:104)
==3954== by 0x207174B9: objective_function<TMBad::global::ad_aug>::evalUserTemplate() (tmb_core.hpp:913)
==3954== by 0x20978AD6: MakeADFunObject_(SEXPREC*, SEXPREC*, SEXPREC*, SEXPREC*, int, SEXPREC*&) (tmb_core.hpp:1273)
==3954== by 0x20978F16: MakeADFunObject (tmb_core.hpp:1411)
==3954== Uninitialised value was created by a stack allocation
==3954== at 0x206FC590: ??? (in /usr/local/RDvalgrind/lib/R/site-library/mmrm/libs/mmrm.so)
==3954==
> coef(mmrmFit)
(Intercept) RACEBlack or African American
30.77747548289312362612691 1.53049977079201293683752
RACEWhite SEXFemale
5.64356534807878595927377 0.32606192204036654747057
ARMCDTRT AVISITVIS2
3.77423004340706746972955 4.83958845465665277174594
AVISITVIS3 AVISITVIS4
10.34211288324631361490447 15.05389826268926611874122
ARMCDTRT:AVISITVIS2 ARMCDTRT:AVISITVIS3
-0.04192625024339907691129 -0.69368537001187691171111
ARMCDTRT:AVISITVIS4
0.62422703246844468694121
> digest(coef(mmrmFit))
[1] "0aadcaf025259096c2a369411fd935bb"
>
==3954==
==3954== HEAP SUMMARY:
==3954== in use at exit: 215,943,732 bytes in 43,628 blocks
==3954== total heap usage: 144,872 allocs, 101,244 frees, 371,752,067 bytes allocated
==3954==
==3954== LEAK SUMMARY:
==3954== definitely lost: 0 bytes in 0 blocks
==3954== indirectly lost: 0 bytes in 0 blocks
==3954== possibly lost: 0 bytes in 0 blocks
==3954== still reachable: 215,943,732 bytes in 43,628 blocks
==3954== of which reachable via heuristic:
==3954== newarray : 4,264 bytes in 1 blocks
==3954== suppressed: 0 bytes in 0 blocks
==3954== Reachable blocks (those to which a pointer was found) are not shown.
==3954== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==3954==
==3954== For lists of detected and suppressed errors, rerun with: -s
==3954== ERROR SUMMARY: 22 errors from 1 contexts (suppressed: 0 from 0)
That is great progress, thanks @luwidmer !
Now I just need to figure out if there is a "more" correct way to use Eigen here (e.g. looking at https://eigen.tuxfamily.org/dox-devel/classEigen_1_1LDLT.html#a8747c1a709736527fd76785459fc827e I am not so sure), or to understand whether this is a problem inside Eigen.
Looking at https://eigen.tuxfamily.org/dox/LDLT_8h_source.html and the call stack from valgrind, my first guess would be that the problem is in the line if(abs(vecD(i)) > tolerance)
(here on line 595, but seems in the code version used in the valgrind run everything is shifted by 3 lines) because either (I think likely) vecD
has not been initialized by the construction in const typename Diagonal<const MatrixType>::RealReturnType vecD(vectorD());
or (less likely I would think) because tolerance
has not been initialized by RealScalar tolerance = (std::numeric_limits<RealScalar>::min)();
(matching the valgrind reported problem "Conditional jump or move depends on uninitialised value(s)")
@ggael Apologies for the question out of nowhere, but any idea if something could be wrong in Eigen here? Or whether I am using Eigen wrongly such that it uses uninitialized values?
@danielinteractive I added some debug print statements in https://github.com/luwidmer/mmrm/tree/issue-mmrm-valgrind (in src/mmrm.cpp) before that line, outputting XtWX and XtWY. It seems the standard calls are fine, it's just the one with TMB in the middle (I presume for the gradients?) that has the valgrind uninitialized memory issue. So probably TMB in conjunction with Eigen, not Eigen itself? See below for an example output:
root@6aca8b552b91:/# RDvalgrind -d "valgrind --tool=memcheck --leak-check=full --track-origins=yes --show-leak-kinds=definite"
==5702== Memcheck, a memory error detector
==5702== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==5702== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==5702== Command: /usr/local/RDvalgrind/lib/R/bin/exec/R
==5702==
R Under development (unstable) (2024-08-16 r87025) -- "Unsuffered Consequences"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(mmrm)
options(digits = 22)
mmrmFit <- mmrm(
formula = FEV1 ~ RACE + SEX + ARMCD * AVISIT + us(AVISIT | USUBJID),
data = fev_data
)
Value of XtWX : 537 0 0 0 0 0 0 0 0 0 0
196 196 0 0 0 0 0 0 0 0 0
141 0 141 0 0 0 0 0 0 0 0
288 94 73 288 0 0 0 0 0 0 0
262 76 87 146 262 0 0 0 0 0 0
140 50 36 75 71 140 0 0 0 0 0
129 42 36 71 58 0 129 0 0 0 0
134 53 37 70 67 0 0 134 0 0 0
71 20 22 41 71 71 0 0 71 0 0
58 13 21 35 58 0 58 0 0 58 0
67 23 23 36 67 0 0 67 0 0 67
Value of XtWY : 22712.9
8019.4
6608.92
12214.5
11646
5567.14
5784.75
6723.86
2976.85
2718.02
3523.72
Value of XtWX : 537 0 0 0 0 0 0 0 0 0 0
196 196 0 0 0 0 0 0 0 0 0
141 0 141 0 0 0 0 0 0 0 0
288 94 73 288 0 0 0 0 0 0 0
262 76 87 146 262 0 0 0 0 0 0
140 50 36 75 71 140 0 0 0 0 0
129 42 36 71 58 0 129 0 0 0 0
134 53 37 70 67 0 0 134 0 0 0
71 20 22 41 71 71 0 0 71 0 0
58 13 21 35 58 0 58 0 0 58 0
67 23 23 36 67 0 0 67 0 0 67
Value of XtWY : 22712.9
8019.4
6608.92
12214.5
11646
5567.14
5784.75
6723.86
2976.85
2718.02
3523.72
Value of XtWX : 537 0 0 0 0 0 0 0 0 0 0
196 196 0 0 0 0 0 0 0 0 0
141 0 141 0 0 0 0 0 0 0 0
288 94 73 288 0 0 0 0 0 0 0
262 76 87 146 262 0 0 0 0 0 0
140 50 36 75 71 140 0 0 0 0 0
129 42 36 71 58 0 129 0 0 0 0
134 53 37 70 67 0 0 134 0 0 0
71 20 22 41 71 71 0 0 71 0 0
58 13 21 35 58 0 58 0 0 58 0
67 23 23 36 67 0 0 67 0 0 67
Value of XtWY : 22712.9
8019.4
6608.92
12214.5
11646
5567.14
5784.75
6723.86
2976.85
2718.02
3523.72
Value of XtWX : {value=537, index=39817, tape=0x12f85ef0} {const=0} {const=0} {const=0} {const=0} {const=0} {const=0} {const=0} {const=0} {const=0} {const=0}
{value=196, index=39818, tape=0x12f85ef0} {value=196, index=39821, tape=0x12f85ef0} {const=0} {const=0} {const=0} {const=0} {const=0} {const=0} {const=0} {const=0} {const=0}
{value=141, index=39289, tape=0x12f85ef0} {const=0} {value=141, index=39296, tape=0x12f85ef0} {const=0} {const=0} {const=0} {const=0} {const=0} {const=0} {const=0} {const=0}
{value=288, index=39290, tape=0x12f85ef0} {value=94, index=37962, tape=0x12f85ef0} {value=73, index=39297, tape=0x12f85ef0} {value=288, index=39303, tape=0x12f85ef0} {const=0} {const=0} {const=0} {const=0} {const=0} {const=0} {const=0}
{value=262, index=39646, tape=0x12f85ef0} {value=76, index=39652, tape=0x12f85ef0} {value=87, index=39298, tape=0x12f85ef0} {value=146, index=39304, tape=0x12f85ef0} {value=262, index=39657, tape=0x12f85ef0} {const=0} {const=0} {const=0} {const=0} {const=0} {const=0}
{value=140, index=39819, tape=0x12f85ef0} {value=50, index=39822, tape=0x12f85ef0} {value=36, index=39299, tape=0x12f85ef0} {value=75, index=39305, tape=0x12f85ef0} {value=71, index=39658, tape=0x12f85ef0} {value=140, index=39824, tape=0x12f85ef0} {const=0} {const=0} {const=0} {const=0} {const=0}
{value=129, index=39293, tape=0x12f85ef0} {value=42, index=39059, tape=0x12f85ef0} {value=36, index=39300, tape=0x12f85ef0} {value=71, index=39306, tape=0x12f85ef0} {value=58, index=39311, tape=0x12f85ef0} {value=-0, index=39315, tape=0x12f85ef0} {value=129, index=39318, tape=0x12f85ef0} {const=0} {const=0} {const=0} {const=0}
{value=134, index=39820, tape=0x12f85ef0} {value=53, index=39823, tape=0x12f85ef0} {value=37, index=38604, tape=0x12f85ef0} {value=70, index=38612, tape=0x12f85ef0} {value=67, index=39659, tape=0x12f85ef0} {value=0, index=39825, tape=0x12f85ef0} {value=-0, index=39065, tape=0x12f85ef0} {value=134, index=39826, tape=0x12f85ef0} {const=0} {const=0} {const=0}
{value=71, index=39649, tape=0x12f85ef0} {value=20, index=39655, tape=0x12f85ef0} {value=22, index=39301, tape=0x12f85ef0} {value=41, index=39307, tape=0x12f85ef0} {value=71, index=39660, tape=0x12f85ef0} {value=71, index=39664, tape=0x12f85ef0} {value=-0, index=39319, tape=0x12f85ef0} {value=0, index=39667, tape=0x12f85ef0} {value=71, index=39669, tape=0x12f85ef0} {const=0} {const=0}
{value=58, index=39295, tape=0x12f85ef0} {value=13, index=37968, tape=0x12f85ef0} {value=21, index=39302, tape=0x12f85ef0} {value=35, index=39308, tape=0x12f85ef0} {value=58, index=39313, tape=0x12f85ef0} {value=-0, index=39317, tape=0x12f85ef0} {value=58, index=39320, tape=0x12f85ef0} {value=-0, index=38636, tape=0x12f85ef0} {value=-0, index=39322, tape=0x12f85ef0} {value=58, index=39323, tape=0x12f85ef0} {const=0}
{value=67, index=39650, tape=0x12f85ef0} {value=23, index=39656, tape=0x12f85ef0} {value=23, index=38607, tape=0x12f85ef0} {value=36, index=38615, tape=0x12f85ef0} {value=67, index=39661, tape=0x12f85ef0} {value=0, index=39665, tape=0x12f85ef0} {value=-0, index=38633, tape=0x12f85ef0} {value=67, index=39668, tape=0x12f85ef0} {value=0, index=39670, tape=0x12f85ef0} {value=-0, index=38642, tape=0x12f85ef0} {value=67, index=39671, tape=0x12f85ef0}
Value of XtWY : {value=22712.9, index=39841, tape=0x12f85ef0}
{value=8019.4, index=39842, tape=0x12f85ef0}
{value=6608.92, index=39345, tape=0x12f85ef0}
{value=12214.5, index=39346, tape=0x12f85ef0}
{value=11646, index=39697, tape=0x12f85ef0}
{value=5567.14, index=39843, tape=0x12f85ef0}
{value=5784.75, index=39349, tape=0x12f85ef0}
{value=6723.86, index=39844, tape=0x12f85ef0}
{value=2976.85, index=39700, tape=0x12f85ef0}
{value=2718.02, index=39351, tape=0x12f85ef0}
{value=3523.72, index=39701, tape=0x12f85ef0}
==5702== Conditional jump or move depends on uninitialised value(s)
==5702== at 0x2048D5EF: void Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>::_solve_impl_transposed<true, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >(Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> const&, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>&) const (LDLT.h:598)
==5702== by 0x20489776: void Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>::_solve_impl<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >(Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> const&, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>&) const (LDLT.h:569)
==5702== by 0x204856BE: Eigen::internal::Assignment<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >, Eigen::internal::assign_op<TMBad::global::ad_aug, TMBad::global::ad_aug>, Eigen::internal::Dense2Dense, void>::run(Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>&, Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > const&, Eigen::internal::assign_op<TMBad::global::ad_aug, TMBad::global::ad_aug> const&) (Solve.h:147)
==5702== by 0x20481E8A: void Eigen::internal::call_assignment_no_alias<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >, Eigen::internal::assign_op<TMBad::global::ad_aug, TMBad::global::ad_aug> >(Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>&, Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > const&, Eigen::internal::assign_op<TMBad::global::ad_aug, TMBad::global::ad_aug> const&) (AssignEvaluator.h:890)
==5702== by 0x2047D0DB: Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>& Eigen::PlainObjectBase<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >::_set_noalias<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > >(Eigen::DenseBase<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > > const&) (PlainObjectBase.h:797)
==5702== by 0x20477826: void Eigen::PlainObjectBase<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >::_init1<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >, Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > >(Eigen::DenseBase<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > > const&) (PlainObjectBase.h:883)
==5702== by 0x20472AC2: Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>::Matrix<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > >(Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > const&) (Matrix.h:332)
==5702== by 0x2046E104: tmbutils::matrix<TMBad::global::ad_aug>::matrix<Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> > >(Eigen::Solve<Eigen::LDLT<Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1>, 1>, Eigen::Matrix<TMBad::global::ad_aug, -1, -1, 0, -1, -1> >) (vector.hpp:106)
==5702== by 0x20469548: objective_function<TMBad::global::ad_aug>::operator()() (mmrm.cpp:106)
==5702== by 0x20317619: objective_function<TMBad::global::ad_aug>::evalUserTemplate() (tmb_core.hpp:913)
==5702== by 0x2057A5D8: MakeADFunObject_(SEXPREC*, SEXPREC*, SEXPREC*, SEXPREC*, int, SEXPREC*&) (tmb_core.hpp:1273)
==5702== by 0x2057AA18: MakeADFunObject (tmb_core.hpp:1411)
==5702== Uninitialised value was created by a stack allocation
==5702== at 0x202FC6E0: ??? (in /usr/local/RDvalgrind/lib/R/site-library/mmrm/libs/mmrm.so)
==5702==
Value of XtWX : 14.4658 0 0 0 0 0 0 0 0 0 0
5.05963 5.05963 0 0 0 0 0 0 0 0 0
3.93859 0 3.93859 0 0 0 0 0 0 0 0
7.82615 2.44471 2.07845 7.82615 0 0 0 0 0 0 0
6.80339 1.80354 2.3626 3.91608 6.80339 0 0 0 0 0 0
4.03578 1.45565 1.046 2.17689 2.04173 6.20055 0 0 0 0 0
7.88694 2.57777 2.21301 4.33629 3.52782 -0.442188 8.95922 0 0 0 0
1.00498 0.387613 0.286653 0.52976 0.494276 -0.183797 0.0262879 1.4578 0 0 0
2.04173 0.585012 0.639344 1.18372 2.04173 3.15502 -0.219541 -0.0931101 3.15502 0 0
3.52782 0.794072 1.28587 2.13015 3.52782 -0.219541 4.03424 0.015466 -0.219541 4.03424 0
0.494276 0.169217 0.172796 0.268239 0.494276 -0.0931101 0.015466 0.730506 -0.0931101 0.015466 0.730506
Value of XtWY : 617.424
210.14
184.496
335.669
304.022
163.388
362.037
56.9573
86.9934
169.172
29.5552
Value of XtWX : 537 0 0 0 0 0 0 0 0 0 0
196 196 0 0 0 0 0 0 0 0 0
141 0 141 0 0 0 0 0 0 0 0
288 94 73 288 0 0 0 0 0 0 0
262 76 87 146 262 0 0 0 0 0 0
140 50 36 75 71 140 0 0 0 0 0
129 42 36 71 58 0 129 0 0 0 0
134 53 37 70 67 0 0 134 0 0 0
71 20 22 41 71 71 0 0 71 0 0
58 13 21 35 58 0 58 0 0 58 0
67 23 23 36 67 0 0 67 0 0 67
Value of XtWY : 22712.9
8019.4
6608.92
12214.5
11646
5567.14
5784.75
6723.86
2976.85
2718.02
3523.72
Thanks a lot @luwidmer ! Interesting. I think we might need to step through the code execution interactively (R and maybe also gdb debuggers) to find the problematic point where something is not yet initialized. I am downloading Docker and the image now as well and we can talk tomorrow about the details. Thanks again!
I now confirmed (after about 5 hours of work) that indeed
RealScalar tolerance = (std::numeric_limits<RealScalar>::min)();
is the problem, somehow in an interplay between the RealScalar
type used by TMB
(which is TMBad::global::ad_aug
) and the standard template library this causes an uninitialized value. Hardcoding this to not be used as I did in https://github.com/danielinteractive/RcppEigen/tree/mem-debug then the memory issue disappears.
That is not the solution of course, therefore I will need more time to
TMB
that shows this problemTMB
team to fix it, likely need to complete currently missing specializations or so inside the TMBad::global::ad_aug
.(fyi @kaskr @gowerc)
Edited to due user error (my bad!).
@danielinteractive @kaskr I just installed the master branch (05f36a7) from kaskr/adcomp (cloned that branch of the repo, changed the version number and the Makefile
to use RDvalgrind
as the command for R
, then make install
), and recompiled mmrm (install.packages from your favorite mirror), and now the use of uninitialized values is indeed fixed:
root@6aca8b552b91:/# RDvalgrind -d "valgrind --tool=memcheck --leak-check=full --track-origins=yes --show-leak-kinds=definite" --vanilla < test.R
==15269== Memcheck, a memory error detector
==15269== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==15269== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==15269== Command: /usr/local/RDvalgrind/lib/R/bin/exec/R --vanilla
==15269==
R Under development (unstable) (2024-08-16 r87025) -- "Unsuffered Consequences"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(mmrm)
> library(digest)
> options(digits = 22)
>
> mmrmFit <- mmrm(
+ formula = FEV1 ~ RACE + SEX + ARMCD * AVISIT + us(AVISIT | USUBJID),
+ data = fev_data
+ )
> coef(mmrmFit)
(Intercept) RACEBlack or African American
30.77747548289312362612691 1.53049977079201293683752
RACEWhite SEXFemale
5.64356534807878595927377 0.32606192204036654747057
ARMCDTRT AVISITVIS2
3.77423004340706746972955 4.83958845465665277174594
AVISITVIS3 AVISITVIS4
10.34211288324631361490447 15.05389826268926611874122
ARMCDTRT:AVISITVIS2 ARMCDTRT:AVISITVIS3
-0.04192625024339907691129 -0.69368537001187691171111
ARMCDTRT:AVISITVIS4
0.62422703246844468694121
> digest(coef(mmrmFit))
[1] "0aadcaf025259096c2a369411fd935bb"
>
==15269==
==15269== HEAP SUMMARY:
==15269== in use at exit: 215,981,441 bytes in 43,641 blocks
==15269== total heap usage: 145,056 allocs, 101,415 frees, 372,654,234 bytes allocated
==15269==
==15269== LEAK SUMMARY:
==15269== definitely lost: 0 bytes in 0 blocks
==15269== indirectly lost: 0 bytes in 0 blocks
==15269== possibly lost: 0 bytes in 0 blocks
==15269== still reachable: 215,981,441 bytes in 43,641 blocks
==15269== of which reachable via heuristic:
==15269== newarray : 4,264 bytes in 1 blocks
==15269== suppressed: 0 bytes in 0 blocks
==15269== Reachable blocks (those to which a pointer was found) are not shown.
==15269== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==15269==
==15269== For lists of detected and suppressed errors, rerun with: -s
==15269== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
@luwidmer , yep, I just confirmed here on my end too that the current master branch of TMB
fixes the problem regarding the memory issue.
Can you maybe double check with one or two spot check runs that this also fixes the reproducibility issue?
@danielinteractive will fire it up on the compute cluster
Just running the example above in a loop (running Rscript) on my Mac before:
And after:
@danielinteractive I'm afraid I have some bad news. I re-ran this example in a loop:
i <- 0
while(TRUE) {
i <- i + 1
system("Rscript mmrm-patch-tmb.R")
print(i)
Sys.sleep(0.05)
}
And only reported the hash of the output if it's different:
library(mmrm)
library(digest)
options(digits = 22)
mmrmFit <- mmrm(
formula = FEV1 ~ RACE + SEX + ARMCD * AVISIT + us(AVISIT | USUBJID),
data = fev_data
)
res <- digest(coef(mmrmFit))
if (res != "903104ec4747c253d773e48e6ab96b32") {
print(res)
}
This yielded
[1] 810
[1] 811
[1] 812
[1] 813
[1] "9d3fdf45b3ef227cb27af3871a0ca0f7"
[1] 814
[1] 815
[1] 816
So there seems to be roughly 1 in 800 chance of getting a different result after applying the patch (much better than before, but still not fully deterministic).
@luwidmer Interesting, how was this rate before, much higher?
And if you run this in RDvalgrind
does it show any memory issues?
@danielinteractive it was much higher before, see https://github.com/openpharma/mmrm/issues/462#issuecomment-2324394363 (unfold the code blocks). Unfortunately no more memory issues detected, though this could still be a race condition or gc-related issue I guess?
I see, ok, so really quite a nice improvement. But of course we want 100% reproducibility 😄
How could it be gc-related? I mean you run this in a new R session each time?
And if you run instead of Rscript
the RDvalgrind
equivalent? and save the log? maybe it could show something for this rare case?
Memory pressure on the system might be different at different times, though I agree this is unlikely to be the cause. I wonder if RDcsan with the thread sanitizer built in might find something, but if that doesn't fly we could indeed try RDvalgrind in a loop until the problem manifests...?
I would expect that memory pressure should not change the results, or is that a known problem in other cases with R? Like running a linear model fit will always give the same result or not?
Yeah maybe you can run this other sanitizer...? Or RDvalgrind with even more sensitive checks...?
@luwidmer I also wonder whether this is just on one platform now or whether you also see this still on other platforms? If you see this on another platform, I can also try to reproduce this here on my laptop.
Just not sure what would be next ... somehow it would be cool to be able to "fix" the expected internal computation results and then immediately signal when one of them is not the same as before... but I guess that is not really possible.
@danielinteractive I see this both on RHEL (R 4.3.1, x64, GCC) and MacOS (R 4.4.1, ARM64, AppleClang), so it is definitely still cross-platform (and cross-compiler + cross-BLAS)
Sorry, didn't see this new problem until now. It's most likely due to the TMB tape optimizer not being completely deterministic. To verify, you can disable the tape optimizer by:
TMB::config(optimize.instantly=0, DLL="mmrm")
There's an internal flag to make the tape optimizer deterministic at some extra computational cost. I'll add this option to TMB sometime soon and probably change the default setting.
Sorry, didn't see this new problem until now. It's most likely due to the TMB tape optimizer not being completely deterministic. To verify, you can disable the tape optimizer by:
TMB::config(optimize.instantly=0, DLL="mmrm")
There's an internal flag to make the tape optimizer deterministic at some extra computational cost. I'll add this option to TMB sometime soon and probably change the default setting.
Added this after library(mmrm)
, and could not get it to be non-deterministic in 2400+ runs! @kaskr you're a genius!
Thanks a lot @kaskr and @luwidmer !
Should we wait for you to change the default in TMB? Or should we from our package side add this option call in the meantime? (Because our users definitely want the reproducible results by default)
@danielinteractive should there also be a warning if this option is turned on (and a unit test for that), so if it's on, users actually get a notification? (same for TMB versions that use uninitialized memory?)
Also, I'm pretty sure we've tackled #246 with this as well :)
Yeah good idea, we could definitely add warnings for these 2 things (TMB version, TMB option)
I've changed the default in TMB to ensure numerically reproducible results. An option is added to control()
the behavior, but that is mainly for testing and might be removed at some point.
Thanks a lot @kaskr! When do you plan the next CRAN release of TMB? (no pressure, just curious)
@danielinteractive I'm thinking maybe in a couple of days if things seem to work.
@kaskr, happy to report that after some more extensive testing the release from https://github.com/kaskr/adcomp/commit/1130f2efb488a708834e4bf618a4e1a4a208ae9c still works fine and deterministically with mmrm for me :)
@luwidmer Thanks, it is now published on CRAN.
Superb, many thanks @kaskr <3 !
@danielinteractive / @luwidmer - In @kaskr's message he mentioned:
There's an internal flag to make the tape optimizer deterministic at some extra computational cost.
I'm curious as to if either of you looked into what the performance costs of this change was ? I appreciate the code should be deterministic by default but if the performance cost is significant is it worth exposing this as an option for users ? (needless to say if the performance cost is negligible then please ignore me 😄 )
@gowerc I didn't do an apples-to-apples comparison, but upgrading from mmrm 0.2.2 to the current version while simultaneously upgrading TMB to 1.9.15 increased speed by factor 2 for the above example (mean time over 200 runs went from 78.5 to 39 ms). Probably would be a case for a microbenchmark with matched versions if you really want to know @gowerc ?
I did not do a speed comparison at all so far on this, prioritizing reliability. However, the TMB option is a user option, so users could change this from their side if they are ok to sacrifice reliability (and they will get a warning from mmrm now)
With 0.2.2 and current TMB, I get a 1 ms average difference (very slightly slower). Also agree with @danielinteractive , reliability before performance.
@danielinteractive, I guess a new mmrm release on CRAN would also cause recompilation of mmrm with the new TMB? Do you know if CRAN updates reverse dependency binaries when their version does not change?
@luwidmer yeah exactly, therefore I am keen to have soon a new CRAN release. I don't think the binaries will be updated otherwise.
You can probably close #246 as well with that release 😄
Summary
Different runs of the minimal example from the Package introduction vignette on the same machine can yield different results. The numerical differences between the results are small but could affect downstream calculations. Running the example twice in the same session will give the same results, but different sessions can yield different results. This issue occurs across different operating systems (Windows, Linux, macOS) and hardware. This might be related to issue #246. This is joint work with @luwidmer and @bailliem.
Your brief description of the problem
R session info
OS / Environment
The outputs above were obtained on:
We have observed the same issue in different configurations (Windows 11 x64 on Ryzen; macOS Sonoma 14.6.1 on aarch64-apple-darwin20). We've also observed the issue also using previous package version (mmrm v0.2.2).