Spline model prediction inconsistencies when changing BLAS/LAPACK implementation

pmhogan commented 1 year ago

Hi @paul-buerkner, and thank you for creating and maintaining such an amazing package 💛

I have encountered and isolated the cause of an issue which feels closely related to the unresolved issues #1386 and #1462, whereby significant inconsistencies can sometimes occur when brms spline models trained in one environment are used for prediction in a different environment. Certain seemingly well-behaved models will produce incorrect results without warning when the BLAS/LAPACK libraries used by R are changed between environments.

These inconsistencies are unpredictable and consistent results appear to be only guaranteed when either the R-internal, reference libraries, or the same version of external libraries, are used in both environments. Please find below a summary of my investigations and a reproducible example; very happy to share full further details (including numerical results, further plots and our Docker container) upon request.

Background

Whilst developing a complex GAM on a Cray cluster, small but persistent inconsistencies were found when calculating predictions on our local Ubuntu machine using the model trained on the cluster compared to predictions using the exact same model trained on the Ubuntu machine.
Comparing two much simpler models (the canonical epilepsy model, and a Poisson spline mortality model over core (large-N) ages ~35-70) between these environments, no discernible inconsistencies were found beyond what could reasonably be attributed to stochasticity.
However the same Poisson spline mortality model over all ages 0-110 (including where data are more sparse) showed clear inconsistencies, with the cluster model producing obviously incorrect results (mortality decreasing with old age) on our local Ubuntu machine:

Posterior predictive check	Force of mortality

Similar inconsistencies were then found between models trained on our local Debian machine and our Ubuntu machine, indicating that the problem is much more general and not just related to the cluster.
Note that no issues or inconsistencies were found when models are used for prediction in the same environment in which they were trained: all models trained on Debian produced correct results on Debian; all models trained on Ubuntu produced correct results on Ubuntu; and all models trained on the cluster produced correct results on the cluster; and all these results were stochastically consistent with each other.
In all cases the models appear to be well-behaved, their summaries appear to be very similar, and there is no indication that anything is amiss until the model objects are used to make predictions in an environment different to the one in which it was trained.

Isolating the cause

Guided by the Stan Reference Manual’s page on model reproducibility and a comment of yours, we systematically tested our minimal reproducible example in different environments to eliminate potential causes of these inconsistencies.
In the tests below, the following necessary components for reproducibility were always kept identical:
- Code and packages :white_check_mark:
- Stan version :white_check_mark:
- Stan interface (RStan, PyStan, CmdStan) and version :white_check_mark:
- Versions of included libraries (Boost and Eigen) :white_check_mark:
- Same configuration of call to Stan, including random seed, chain ID, initialization and data :white_check_mark:

Test: changing R version

In this test on the cluster, the following components were kept identical:

Version of interface language (R, Python, shell) :no_entry_sign:
- R 3.6.3 vs R 4.1.1
Operating system version :white_check_mark:
Computer hardware including CPU, motherboard and memory :white_check_mark:
C++ compiler, including version, compiler flags, and linked libraries :white_check_mark:
- Using BLAS/LAPACK implementation cray-libsci/20.09.1

No inconsistencies between environments were found; changing R version with the same external BLAS/LAPACK libraries gives identical, correct results. The cause is therefore is unlikely to be the R version, nor the external libraries per se.

Test: changing system architecture

In this test, the following components were kept identical:

Version of interface language (R, Python, shell) :white_check_mark:
Operating system version :no_entry_sign:
- Windows vs macOS vs Ubuntu
Computer hardware including CPU, motherboard and memory :no_entry_sign:
- Different machines
C++ compiler, including version, compiler flags, and linked libraries :no_entry_sign:
- But using the R-internal reference implementation of BLAS/LAPACK in each case

No inconsistencies between environments were found; changing system architecture whilst using the R-internal BLAS/LAPACK libraries gives stochastically consistent, correct results. The cause is therefore likely to be the C++ compiler, compiler flags or linked libraries themselves.

Test: changing C++ compiler

In this test on the cluster, the following components were kept identical:

Version of interface language (R, Python, shell) :white_check_mark:
Operating system version :white_check_mark:
Computer hardware including CPU, motherboard and memory :white_check_mark:
C++ compiler, including version :no_entry_sign:
- gcc7 vs gcc9 vs gcc11
Compiler flags, and linked libraries :white_check_mark:
- Using BLAS/LAPACK implementation cray-libsci/20.09.1

No inconsistencies between environments were found; changing compiler whilst using the same external BLAS/LAPACK libraries gives stochastically consistent results. The cause is therefore unlikely to be the complier version; complier flags also judged to be unlikely but not explicitly tested.

Test: changing linked BLAS/LAPACK library implementation

In this test using a Docker container on the same local Ubuntu machine, the following components were kept identical:

Version of interface language (R, Python, shell) :white_check_mark:
Operating system version :white_check_mark:
Computer hardware including CPU, motherboard and memory :white_check_mark:
C++ compiler, including version and compiler flags :white_check_mark:
Linked libraries :no_entry_sign:
- R-internal, reference implementation of BLAS/LAPACK vs openblasp-r0.3.8

Dramatic inconsistencies were found; changing from the R-internal, reference BLAS/LAPACK libraries to OpenBLAS in an otherwise identical environment produces clearly incorrect predictions. The cause therefore appears to be changing the BLAS/LAPACK library implementations between the training and prediction environments.

Inconsistencies were also found in a similar test comparing openblasp-r0.3.5 to openblasp-r0.3.8; therefore changing even the version of an external BLAS/LAPACK library appears to be potentially problematic. However some external BLAS/LAPACK libraries are (stochastically) consistent with each other: in an additional test comparing openblasp-r0.3.5 on Debian to cray-libsci/20.09.1 on the cluster no inconsistencies were found.

Test: does the cause lie with the underlying mgcv spline?

Running the frequentist version of our reproducible example with mgcv::gam() using exactly the same GAM formula and data on our local Ubuntu machine (using the R-internal, reference implementation of BLAS/LAPACK) and the cluster (using cray-libsci/20.09.1) found no inconsistencies between these environments. The cause is therefore unlikely to lie with the underlying mgcv spline itself.

Solution

Changing to the R-internal, reference implementation of BLAS/LAPACK was not possible directly on the cluster. Instead I deployed our Docker container, in which the following components were kept identical when compared to our local Ubuntu machine:

Version of interface language (R, Python, shell) :white_check_mark:
Operating system version :white_check_mark:
Computer hardware including CPU, motherboard and memory :no_entry_sign:
- Local machine vs cluster
C++ compiler, including version, compiler flags, and linked libraries :white_check_mark:

No inconsistencies between environments were found; changing only computer hardware whilst using the R-internal BLAS/LAPACK libraries gives identical, correct results. Model portability and reproducibility between environments therefore appears to be retained when using the R-internal BLAS/LAPACK libraries in both environments.

As to the underlying reason exactly why changing these BLAS/LAPACK libraries sometimes causes these inconsistencies in spline model predictions, that is way beyond my expertise. Whilst there is evidence that OpenBLAS and friends can speed up brms modelling runtime, we feel that for us these gains are not worth the resulting loss of model reproducibility and portability, and will be sticking to the R-internal, reference implementations as the solution to this issue for now.

Could it however be useful in future versions of brms for models to perhaps check that the BLAS/LAPACK implementation in the environment in which they are used for prediction is the same as that in which they was trained?

I hope these investigations are helpful in getting to the bottom of this issue, and thanks again Paul for creating such a fantastic Bayesian package 🤩

paul-buerkner commented 1 year ago

Oh wow, what an amazing and detailed analysis! Thank you so much!

I am glad you have managed to isolate this problem. I still don't understand why/where brms does something affected by BLAS/LAPACK while mgcv is not, but I will check all the brms post processing for any trace of special code that may cause this issue.

In the worst case, we can, as you say, at least add a check to warn the user about a potential problem. In this context, a small question for you. How can I ask R to give me information about the BLAS/LAPACK implementation used in an environment?

Also, another question, if that is not too much to ask: Are the posteriors of the spline coefficients actually systematically different if predictions are different? I.e. is the problem really in the posterior samples of spline coefficients? I would assume yes, because otherwise the problem would be even weirder than it already is. But at this point anything is possible with this issue.

FedericoComoglio commented 1 year ago

Hi @paul-buerkner, the simplest way to check the BLAS/LAPACK version is via sessionInfo().

pmhogan commented 1 year ago

Thank you, Paul, for your feedback and quick response!

Finding out which BLAS/LAPACK implementation is used in a given environment is frustratingly murky and not always precise. In at least more modern versions of R, calling sessionInfo() gives e.g. for my local Ubuntu machine:

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so

the latter three lines which in turn come from the results of calling getOption("matprod"), extSoftVersion()["BLAS"] and La_library() respectively.

The former appears to be an important and necessary correction to even the reference (Fortran) implementations of the libraries as "BLAS libraries do not propagate NaN or Inf values correctly and for inputs with NaN/Inf values the results may be undefined."
The latter two give "an indication of which implementation is in use"... but only an indication! From the BLAS docs (and likewise for LAPACK):

Typically, the R version of BLAS will appear as libR.so (libR.dylib), R or libRblas.so (libRblas.dylib), depending on how R was built. Note that libRblas.so (libRblas.dylib) may also be shown for an external BLAS implementation that had been copied, hard-linked or renamed by the system administrator. For an external BLAS, a shared object file will be given and its path/name may indicate the vendor/version. The detection does not work on Windows nor for the Accelerate framework on macOS.

In my example above, I know that this is the R-internal version because I sat down and went through our Docker definition file with our Deployment Engineer. But it may not have been, had unbeknown to me "an external... implementation... been copied, hard-linked or renamed" 😱
On Windows, we don't get any useful information from the above commands. But I know it was the R-internal version because I installed it and didn't override the default libraries with external implementations:

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default
[MISSING BLAS/LAPACK INFO]

On my macOS machine with the R-internal version we get similar to (libBLAS.dylib) but not exactly what we would expect (libRblas.dylib with an extra R) from the above (note also the caveats about about the Accelerate framework):

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

External implementations then give widely different results, from which exact details and documentation can be extremely hard to pin down. For example, from our cluster:
```
> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-suse-linux-gnu (64-bit)
Running under: SUSE Linux Enterprise Server 15 SP2
```

Matrix products: default BLAS/LAPACK: /opt/cray/pe/libsci/20.09.1/GNU/8.1/x86_64/lib/libsci_gnu_82_mp.so.5.0


Digging into the cluster module help files, this is  Cray LibSci 20.09.1, which I'm guessing are the [Cray Scientific and Math Libraries](https://support.hpe.com/hpesc/public/docDisplay?docId=a00114930en_us&docLocale=en_US&page=Cray_Scientific_and_Math_Libraries_CSML_.html). But I can't find the documentation for v20.09.1.

- On our Debian machine we used an OpenBLAS pthread implementation v0.3.5, which includes LAPACK and gives:

sessionInfo() R version 3.6.3 (2020-02-29) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 10 (buster)

Matrix products: default BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.5.so



- All in all, there does not appear to be a fail-safe way of reliably and precisely identifying which BLAS/LAPACK implementation is used in a given environment, which is deeply frustrating. Perhaps until it is possible to identify and fix the issue at source, a warning to users that they should *ensure* they are using the default, R-internal implementations (or the exact same version used in training) to retain spline model reproducibility and portability might be the way to go?

urskalbitzer commented 1 year ago

Hi @pmhogan , hi all, Thanks so much for putting all this effort into solving this issue and all the effort writing down these summaries to share them here. Is there a "simple" way to use the default, R-internal implementations from within R Studio Server/Posit, or do I have to talk to the people who are managing the server?

pmhogan commented 1 year ago

@urskalbitzer -- not really as far as I can tell 😐 I was unable to find a way to install the R-internal implementation on the cluster and had to resort to a container. If you have admin rights on the server, this post describes how to install and switch between alternatives. The package flexiblas also looks promising, though I haven't tried it and might also need admin rights. Otherwise, you may have to talk to the admins, as I in the end had to resort to...

pmhogan commented 1 year ago

@paul-buerkner -- regarding the posterior samples of the spline coefficients, yes they do appear to be different between the reprex trained on Ubuntu (using the R-internal implementation) and on the cluster (using cray-libsci/20.09.1). Already from the model summaries the sds terms look different (all results and plots generated on Ubuntu):

> model_ubuntu
...
Smooth Terms: 
            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sds(sAge_1)    13.54      3.53     9.03    22.48 1.00      495      664

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -5.31      0.01    -5.33    -5.30 1.00     1916     2140
sAge_1      -34.95      0.99   -36.90   -33.02 1.00     1810     1960

> model_cluster
...
Smooth Terms: 
            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sds(sAge_1)    13.29      3.09     8.64    20.33 1.01      425      854

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -5.31      0.01    -5.33    -5.30 1.00     1418     1826
sAge_1      -34.95      1.03   -36.95   -32.88 1.00     1284     1596

Plotting these reveals more differences:

`b_Intercept`	`bs_sAge_1`	`sds_sAge_1`

The pairs plots for the spline parameters in particular have a noticeably different shape between models:

Ubuntu model	Cluster model

If you would like to further inspect the model obejcts yourself, they are available in the reprex repo under /models, named by the environment in which they were trained.

paul-buerkner commented 1 year ago

Thank you! For the hyperparameter you show above, the differences seem quite small actually, and perhaps not necessarily relevant. However, when looking at the actual spline coefficients, e.g. via

mcmc_plot(mortality_windows_R4_1_1, variable = "^s_", regex = TRUE)
mcmc_plot(mortality_cluster_R4_1_1, variable = "^s_", regex = TRUE)

we see that they are vastly different:

plot_windows_R4 1 1 plot_cluster_R4 1 1

Accordingly, because (a) the posterior of the spline coefficients are different, yet (b) predictions are correct if the fitting and prediction environment is the same, we can conclude that the evaluated spline basis functions (i.e., the spline "design matrices") are different. These matrices are called Zs_* when created via make_standata. It is plausible to assume that BLAS/LAPACK affects how they are computed, which happens almost entirely within mgcv.

Above, you said that mgcv does not suffer from this problem. However, I assume you have tested with mgcv::gam not with mgcv::gamm, right? The latter uses the "random effects" parameterization of splines, which is exactly what brms does too (it uses mgcv as if it was preparing the design matrices for use in mgcv::gamm).

Would you mind checking in your setup whether mgcv::gamm has the same problems as brms? If yes, then we know that the problem is in the reparameterization routines of mgcv. If not, then we can narrow down the problem to a few dozen lines of brms code. I doubt that there is any problem in the latter, because I basically do zero linear algebra there, but I have been surprised before.

pmhogan commented 1 year ago

Ah-ha! Curiouser and curiouser 🧐

Yes, when I tested mgcv I used mgcv::gam which did not show these inconsistencies -- I did not realise that brms actually uses mgcv:gamm for a spline model, even in the absence of explicit group effects. I will try and test this today, before I leave on a ten-day vacation tomorrow.

paul-buerkner commented 1 year ago

Thank you so much! brms doesn't use mgcv::gamm directly, but its data-preparation pipeline, concretely mgcv::smooth2random(..., type = 2). Somewhere in the latter place is where I assume the problem to occur.

pmhogan commented 1 year ago

For sure, happy to help how I can. To double check, in my previous mgcv test I used the formula:

mgcv::gam(Deaths ~ s(Age) + offset(log(Exposure)), ...)

For this next test, should I simply substitute mgcv::gam for mgcv::gamm without any further changes, or is there anything else I need to adjust?

paul-buerkner commented 1 year ago

Yes, I think just this should suffice.

pmhogan commented 1 year ago

Okay, I have run the mgcv test on Ubuntu and on the cluster substituting mgcv::gam for mgcv::gamm without any further changes, and the resulting outputs on Ubuntu from the command plot(<model>$gam) appear to be very similar:

Ubuntu model on Ubuntu	Cluster model on Ubuntu

However, digging into the docs for mgcv::smooth2random I spotted some differences in this function's output when it is run on Ubuntu (using the R-internal implementation) and on the cluster (using cray-libsci/20.09.1). Following the syntax in the man page and using the data in my reprex:

data <- tibble::tibble(Age = 0:110, Deaths = MortalityLaws::ahmd$Dx$`2010`, Exposure = MortalityLaws::ahmd$Ex$`2010`)
sm <- smoothCon(s(Age), data)[[1]]
re <- smooth2random(sm, "", type = 2)
str(re)

we get on Ubuntu:

List of 8
 $ rand   :List of 1
  ..$ Xr: num [1:111, 1:8] -0.0631 -0.0636 -0.0641 -0.0645 -0.0647 ...
  .. ..- attr(*, "s.label")= chr "s(Age)"
 $ Xf     : num [1:111, 1:2] -1.72 -1.69 -1.65 -1.62 -1.59 ...
 $ trans.U: num [1:10, 1:10] -0.00335 -0.01251 -0.00524 -0.01564 0.0068 ...
 $ trans.D: num [1:10] 0.0569 0.0741 0.0776 0.1108 0.1229 ...
 $ fixed  : logi FALSE
 $ rind   : int [1:8] 1 2 3 4 5 6 7 8
 $ rinc   : num [1:8] 8 8 8 8 8 8 8 8
 $ pen.ind: num [1:10] 1 1 1 1 1 1 1 1 0 0

but on the cluster it's a little different, with different values of Xr and some values of trans.U having their sign flipped:

List of 8
 $ rand   :List of 1
  ..$ Xr: num [1:111, 1:8] -0.119 -0.119 -0.118 -0.118 -0.117 ...
  .. ..- attr(*, "s.label")= chr "s(Age)"
 $ Xf     : num [1:111, 1:2] -1.72 -1.69 -1.65 -1.62 -1.59 ...
 $ trans.U: num [1:10, 1:10] 0.00335 -0.01251 0.00524 -0.01564 -0.0068 ...
 $ trans.D: num [1:10] 0.0569 0.0741 0.0776 0.1108 0.1229 ...
 $ fixed  : logi FALSE
 $ rind   : int [1:8] 1 2 3 4 5 6 7 8
 $ rinc   : num [1:8] 8 8 8 8 8 8 8 8
 $ pen.ind: num [1:10] 1 1 1 1 1 1 1 1 0 0

I don't know what to make of this, but I attach the relevant files here if you would like to take a closer look:

rstudio-export.zip

paul-buerkner commented 1 year ago

Yes, brms uses the Xr (random effects design matrices), so that seems to explain the difference although I am not fully sure why mgcv::gamm itself works fine.

Perhaps because it backtransforms spline parameters from the "random effects" to the "penalized spline" parameterization directly after model fitting (i.e. in the environment the model it fitted). There, the linear algebra is consistent and so is the back transformation. Brms doesn't do this but continues to use the "random effects" coefficients and design matrices for prediction, which then yields the issues we are seeing.

paul-buerkner commented 1 year ago

I will make the developer of mgcv, Simon Wood, aware of this issue here and hope that he can help us find a fix for this issue.

paul-buerkner commented 1 year ago

Simon has responded and was very helpful. The important part he said is the following:

Eigenvector signs are obviously arbitrary, so that could indeed result in nonsense (depending on the architecture, you are not even absolutely guaranteed the same sign on a repeat call with the same blas!).

We can fix it in brms by storing the smooth2random output computed at model fitting time and then reuse this. I will have to figure out how to best integrate this into the existing code structure though. Simon also mentioned that there may be a fix from within mgcv, but that might be a bit tedious and I don't want to impose a lot of work on him just because of how I implement stuff in brms.

paul-buerkner commented 1 year ago

Simon just send me an updated version of mgcv (attached below) that he assumes will fix this problem we see in brms. @pmhogan (and everyone else encountering this issue), would you mind trying it out?

mgcv_1.8-43.tar.gz

urskalbitzer commented 1 year ago

I've tried it out but without success.

On the linux server (in a R Studio environment) I have:

uninstalled brms
uninstalled mccv
re-installed brms from GitHub
installed mgcv from the file posted above (mccv_1.8-43.tar.gz)
Re-ran the reprex from issue #1462
Saved and downloaded brmsfit object

On my local computer, I have:

re-installed brms from GitHub
Tried to re-install mgcv from the filed posted above, which failed. Do I need it here as well or is it sufficient to install it in the environment where the model is created?
Checked and compared the output from conditional_effect(brmsfit-object) => Same problem as reported in #1462

Not sure if I have to re-install mgcv in both environments and I can try again. Or post any required information. but perhaps someone else could try as well before I post a lot of information.

paul-buerkner commented 1 year ago

you need the new mgcv in both environments as far as I understood Simon.

Urs Kalbitzer @.***> schrieb am Mo., 13. März 2023, 21:35:

I've tried it out but without success.

On the linux server (in a R Studio environment) I have:

uninstalled brms

uninstalled mccv

re-installed brms from GitHub

installed mgcv from the file posted above (mccv_1.8-43.tar.gz)

Re-ran the reprex from issue #1462 https://github.com/paul-buerkner/brms/issues/1462

Saved and downloaded brmsfit object

On my local computer, I have:

re-installed brms from GitHub

Tried to re-install mgcv from the filed posted above, which failed. Do I need it here as well or is it sufficient to install it in the environment where the model is created?

Checked and compared the output from conditional_effect(brmsfit-object) => Same problem as reported in #1462 https://github.com/paul-buerkner/brms/issues/1462

Not sure if I have to re-install mgcv in both environments and I can try again. Or post any required information. but perhaps someone else could try as well before I post a lot of information.

— Reply to this email directly, view it on GitHub https://github.com/paul-buerkner/brms/issues/1465#issuecomment-1466831858, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCW2ABNETFFWXVZDRCYCELW35ZJVANCNFSM6AAAAAAVMLEWPE . You are receiving this because you were mentioned.Message ID: @.***>

urskalbitzer commented 1 year ago

Unfortunately, I am not able to install this version of mgcv from the file on MacOS (tried it on two computers). Perhaps, someone else with a local linux/windows machine can try it.

install.packages("~/Downloads/mgcv_1.8-43.tar", repos = NULL)
#> Error in install.packages("~/Downloads/mgcv_1.8-43.tar", repos = NULL): type == "both" cannot be used with 'repos = NULL'
install.packages("~/Downloads/mgcv_1.8-43.tar", repos = NULL, type = "binary")
#> Error: file '~/Downloads/mgcv_1.8-43.tar' is not a macOS binary package
install.packages("~/Downloads/mgcv_1.8-43.tar", type = "binary")
#> Warning: package '~/Downloads/mgcv_1.8-43.tar' is not available as a binary package for this version of R
#> 
#> A version of this package for your version of R might be available elsewhere,
#> see the ideas at
#> https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

^{Created on 2023-03-14 with reprex v2.0.2}

paul-buerkner commented 1 year ago

Use type = "source" in install.packages. That should do the trick.

urskalbitzer commented 1 year ago

Unfortunately, that doesn't work out either

If it's helpful, I can post the .rds file with the brmsfit object so that someone else can run it.

From Reprex:

install.packages("~/Downloads/mgcv_1.8-43.tar", repos = NULL, type = "source")
#> Warning in install.packages("~/Downloads/mgcv_1.8-43.tar", repos = NULL, :
#> installation of package '/Users/Urs/Downloads/mgcv_1.8-43.tar' had non-zero
#> exit status
install.packages("~/Downloads/mgcv_1.8-43.tar", type = "source")
#> Warning: package '~/Downloads/mgcv_1.8-43.tar' is not available for this version of R
#> 
#> A version of this package for your version of R might be available elsewhere,
#> see the ideas at
#> https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

^{Created on 2023-03-14 with reprex v2.0.2}

(there is some text that was shown in the console but not included in the reprex:

#>* installing *source* package ‘mgcv’ ...
#>** using staged installation
#>** libs
#>clang -mmacosx-version-min=10.13 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I/usr/local/include   -fPIC  -Wall -g -O2  -c coxph.c -o coxph.o
#>clang -mmacosx-version-min=10.13 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I/usr/local/include   -fPIC  -Wall -g -O2  -c davies.c -o davies.o
...(some more lines like above)

)

paul-buerkner commented 1 year ago

Thank you for trying it out. In this case, we probably have to wait for @pmhogan to check it out on their setup.

paul-buerkner commented 1 year ago

If you like, you can still post the brmsfit fit rds file here. I can install the new mgcv version myself so can try it out.

urskalbitzer commented 1 year ago

Sounds good, thanks! It's based on the following code and the plot with conditional_effects() looks normal on the linux server but not on my local machine:

library(brms)
set.seed(123)
df <- data.frame(x = rep(1:50, each = 10))
df$y_prob <- (sin(df$x) + 1)/seq(from = 2, to = 4, length.out = 500)
df$y_count = rbinom(n = length(df$y_prob), size = 4, prob = df$y_prob)

mod <- brm(y_count | trials(4) ~ t2(x, bs = c("cr"), k = 50),
           family = binomial,
           chain = 1,
           data = df)
saveRDS(mod, file = here::here("model_results/linux_macos_issue_model.rds"))

linux_macos_issue_model.rds.zip

paul-buerkner commented 1 year ago

Thanks! Doesn't look similar enough yet, unfortunately. But let's wait what @pmhogan find in their setup. If it remains unfixed, I will ask Simon again.

pmhogan commented 1 year ago

Thank you for the further investigations, @paul-buerkner, @urskalbitzer and Simon! I have returned from my holidays and will test this on my setup as soon as possible and get back to you 👍

pmhogan commented 1 year ago

@paul-buerkner -- I have run the Poisson spline mortality reprex and the mgcv::smooth2random test on Ubuntu and on the cluster using Simon's updated version of mgcv v1.8-43 installed on both machines, and unfortunately it does not appear to have resolved the issue 😕

Ubuntu model on Ubuntu	Cluster model on Ubuntu
using mgcv_1.8-43	using mgcv_1.8-43

`bayes_R2(mortality_ubuntu_mgcv_43)[[1]] = 0.9982312`	`bayes_R2(mortality_cluster_mgcv_43)[[1]] = 0.4813196`

However the output of smooth2random(..., type = 2) on Ubuntu has changed from previously (which used mgcv v1.8-41):

> str(re)
List of 8
 $ rand   :List of 1
  ..$ Xr: num [1:111, 1:8] 0.0631 0.0636 0.0641 0.0645 0.0647 ...
  .. ..- attr(*, "s.label")= chr "s(Age)"
 $ Xf     : num [1:111, 1:2] 1.72 1.69 1.65 1.62 1.59 ...
 $ trans.U: num [1:10, 1:10] 0.00335 0.01251 0.00524 0.01564 -0.0068 ...
 $ trans.D: num [1:10] 0.0569 0.0741 0.0776 0.1108 0.1229 ...
 $ fixed  : logi FALSE
 $ rind   : int [1:8] 1 2 3 4 5 6 7 8
 $ rinc   : num [1:8] 8 8 8 8 8 8 8 8
 $ pen.ind: num [1:10] 1 1 1 1 1 1 1 1 0 0

but on the cluster it appears to be unchanged from previously:

> str(re)
List of 8
 $ rand   :List of 1
  ..$ Xr: num [1:111, 1:8] -0.119 -0.119 -0.118 -0.118 -0.117 ...
  .. ..- attr(*, "s.label")= chr "s(Age)"
 $ Xf     : num [1:111, 1:2] -1.72 -1.69 -1.65 -1.62 -1.59 ...
 $ trans.U: num [1:10, 1:10] 0.00335 -0.01251 0.00524 -0.01564 -0.0068 ...
 $ trans.D: num [1:10] 0.0569 0.0741 0.0776 0.1108 0.1229 ...
 $ fixed  : logi FALSE
 $ rind   : int [1:8] 1 2 3 4 5 6 7 8
 $ rinc   : num [1:8] 8 8 8 8 8 8 8 8
 $ pen.ind: num [1:10] 1 1 1 1 1 1 1 1 0 0

Please find the full mgcv v1.8-43 smooth2random outputs below:

rstudio-export_mgcv_43.zip

rubenarslan commented 1 year ago

Just a note to say that we are indeed (as Simon Wood anticipated) massive inconsistencies even on the same machine, same R, same BLAS (/usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3) when a model is reloaded in a new session.

This is with mgcv_1.8-42 (we didn't update yet, since @pmhogan says it failed to solve the problem).

paul-buerkner commented 1 year ago

Okay, I really need to fix this. Will prioritize it tomorrow.

rubenarslan commented 1 year ago

I'm happy to stand by to run one of the reprexes on our (apparently particularly prone to this bug) cluster.

andymilne commented 1 year ago

In case you haven't seen this, maybe these BLAS/LAPACK changes in R4.3.0 are of relevance ...

https://cran.r-project.org/bin/windows/base/NEWS.R-4.3.0.html

paul-buerkner commented 1 year ago

I think I have now finally fixed this issue in the issue-1465 branch. The fix is clean I think and should work even with mgcv versions < 1.8-43.

Before I merge, can someone please try it out on their cluster setup and confirm that this issue is really fixed?

paul-buerkner commented 1 year ago

This fix will work out of the box only for newly fitted models. But it should also be working for old models, if one runs restructure() on them on the original model fitting machine before saving them to disk again.

rubenarslan commented 1 year ago

broken fixed

I used @urskalbitzer's example model. Fitting a new model with the issue branch version worked (i.e., plot looks good both on Linux cluster and my local Mac). Restructuring a "broken" model also works! This is great, thanks so much for the fix.

Given how badly the splines can mislead, I would suggest to emit a warning whenever a model that hasn't been restructured with the new code is used and the blas/platform has changed (assuming that this is saved somewhere).

paul-buerkner commented 1 year ago

Thank you @rubenarslan! What do I see in the two plots above? The look quite different but I assume one is the old (wrong) plot and one is the new (correct) plot?

rubenarslan commented 1 year ago

Sorry, only wrote it in the "alt" text. Yes, the first one is one where the bug occurred, the second one is the right result (with the new version or after running restructure).

paul-buerkner commented 1 year ago

Great, thank you!

paul-buerkner commented 1 year ago

Closing now. Thank you all for your help to fix this issue!

urskalbitzer commented 1 year ago

Also many, many thanks from my side to you, @paul-buerkner, and all people who helped to fix this issue. This will be very helpful to proceed with one of my projects.

pmhogan commented 1 year ago

Amazing! I have also just been able to test the fix on our cluster using the Poisson spline mortality reprex, and it likewise works like a dream 😀 The bayes_R2() values, plus the conditional_smooths() and pp_check() plots for the Ubuntu, cluster and restructured cluster models are now all 100% consistent with each other 🎉

If I may offer one suggestion, perhaps it could be stressed in the restructure() man page (as you did in your comment) that this restructuring only works as expected when used on a model in the environment in which it was originally trained? This is perhaps obvious in hindsight, but it nearly tripped me up on my first attempt...

Thank you so much for the fix, @paul-buerkner (and for creating the whole brms universe), and to all those who helped get to the bottom of this issue 🙏

paul-buerkner commented 1 year ago

Thank you all again for helping me to fix this!

I have also updated the doc of restructure to incorporate @pmhogan's suggestion.

paul-buerkner / brms