rstudio / keras3

R Interface to Keras
https://keras3.posit.co/
Other
839 stars 282 forks source link

After update, MWE does not produce same/similar output anymore (fit() or predict() problem) #1411

Open mhofert opened 9 months ago

mhofert commented 9 months ago

Hi,

After a recent update of Python/TensorFlow/Keras, a minimal working example (MWE) I used to run to produce samples from a target distribution does not produce such samples anymore (close but clearly from a different distribution; see the attached screenshots below). After more than 24h searching the needle in the haystack, I'm still clueless. A colleague ran the MWE under his setup on Windows with older versions of Python/TensorFlow/Keras and obtained the correct samples as we always did. And so did another colleague on macOS. Our loss functions also produce very similar values, so we are still unsure whether it's keras' fit() or predict().

Here is the full story which, by now, I consider a 'bug' in the hope others may see this post when realizing their networks don't train/predict properly anymore. The biggest issue is that this can remain entirely undetected as the loss functions don't indicate any problem... hence this post. Also, it means that certain R packages (e.g. 'gnn') currently can work for some (my colleague) but not others (myself) without any warning.

The MWE trains a single-hidden-layer neural network (NN) to act as a random number generator (RNG). I pass iid N(0,1) samples through the NN and then compare them to given dependent multivariate samples from some target distribution (here: scaled ranks of absolute values of correlated normals) with the loss function MMD (maximum mean discrepancy) that we implemented (jointly with the NN, this is called a GMMN, a generating moment matching network).

The MWE below worked well with R running inside a virtual Python environment (installed with Minimorge3 on my M1 14" MacBook Pro, first gen) and then TensorFlow installed via "conda install -c apple tensorflow-deps" and "python -m pip install tensorflow-metal". This was until about a year ago. When I wanted to run the MWE again this week, I received:

Error: Valid installation of TensorFlow not found.

Python environments searched for 'tensorflow' package:
/usr/local/miniforge3/bin/python3.10
...
ModuleNotFoundError: No module named 'tensorflow'

You can install TensorFlow using the install_tensorflow() function.

After reinstalling Python/TensorFlow/Keras in the exact way as I used to do, I still received this error. I then read on https://github.com/t-kalinowski/deep-learning-with-R-2nd-edition-code/issues/3 that the following is the (now) recommended way to install Python/TensorFlow/Keras on all platforms, so I did:

install.packages("remotes")
remotes::install_github("rstudio/keras")
reticulate::install_python()
keras::install_keras()

After that, the MWE ran again. However, it did not properly generate samples from the target distribution anymore. I cannot go back to older versions of the R package 'keras' as then the above error appears again.

Here is the MWE with sessionInfo() etc., also for the outputs of my colleague (on Windows). Again, he obtains very similar loss values, but my generated samples look like normals, not asymmetric anymore as they should (and his are fine).

library(tensorflow) # only needed for our custom MMD loss function
library(keras)

## Generate training data U (scaled ranks of absolute values of correlated normals)
d <- 2 # bivariate case
P <- matrix(0.9, nrow = d, ncol = d); diag(P) <- 1 # correlation matrix
A <- t(chol(P)) # Cholesky factor
ntrn <- 50000 # training data sample size
set.seed(271)
Z <- matrix(rnorm(ntrn * d), ncol = d) # generate N(0,1)
X <- abs(Z %*% t(A)) # absolute values of N(0,P) samples
U <- apply(X, 2, rank) / (ntrn + 1) # training data
if(FALSE)
    plot(U, pch = ".") # ... to see the rough sample shape we are aiming for

## Helper function for custom MMD loss function (from 'gnn')
radial_basis_function_kernel <- function(x, y, bandwidth = 10^c(-3/2, -1, -1/2, -1/4, -1/8, -1/16))
{
    x. <- tf$expand_dims(x, axis = 1L)
    y. <- tf$expand_dims(y, axis = 0L)
    dff2 <- tf$square(x. - y.)
    dst2 <- tf$reduce_sum(dff2, axis = 2L)
    dst2.vec <- tf$reshape(dst2, shape = c(1L, -1L))
    fctr <- tf$convert_to_tensor(as.matrix(1 / (2 * bandwidth^2)), dtype = dst2.vec$dtype)
    kernels <- tf$exp(-tf$matmul(fctr, b = dst2.vec))
    tf$reshape(tf$reduce_mean(kernels, axis = 0L),
               shape = tf$shape(dst2))
}

## Maximum mean discrepancy (MMD) loss function (from 'gnn')
MMD <- function(x, y, ...)
{
    is.R.x <- !tf$is_tensor(x)
    is.R.y <- !tf$is_tensor(y)
    if(is.R.x) x <- tf$convert_to_tensor(x, dtype = "float64")
    if(is.R.y) y <- tf$convert_to_tensor(y, dtype = "float64")
    res <- tf$sqrt(tf$reduce_mean(radial_basis_function_kernel(x, y = x, ...)) +
                   tf$reduce_mean(radial_basis_function_kernel(y, y = y, ...)) -
                   2 * tf$reduce_mean(radial_basis_function_kernel(x, y = y, ...)))
    if(is.R.x || is.R.y) as.numeric(res) else res
}

## Setup model
in.lay <- layer_input(shape = 2)
hid.lay <- layer_dense(in.lay,  units = 300, activation = "relu")
out.lay <- layer_dense(hid.lay, units = 2,   activation = "sigmoid")
model <- keras_model(in.lay, out.lay)
compile(model, optimizer = "adam", loss = function(x, y) MMD(x, y = y))
## Note:
## 1) Even with loss = "mse" I get different sample shapes than before
##    (before they were scattered around (1/2, 1/2), now they seem to be normal around (1/2, 1/2))
## 2) With optimizer = optimizer_adam() instead of optimizer = "adam", I get the following
##    (but training seems to remain unaffected):
##    WARNING:absl:At this time, the v2.11+ optimizer `tf.keras.optimizers.Adam` runs slowly on M1/M2 Macs, please use the legacy Keras optimizer instead, located at `tf.keras.optimizers.legacy.Adam`.
##    WARNING:absl:There is a known slowdown when using v2.11+ Keras optimizers on M1/M2 Macs. Falling back to the legacy Keras optimizer, i.e., `tf.keras.optimizers.legacy.Adam`.
## 3) I also tried optimizer = keras$optimizers$legacy$Adam() but it makes no difference

## Train
fit(model,
    x = matrix(rnorm(ntrn * d), ncol = 2), # prior sample (here: training)
    y = U, # training data to match (here: target data)
    batch_size = 500, epochs = 10) # small values here, but enough so that we should see barely any difference between the generated samples and those in the training data

## Generate from trained model by passing through new prior samples
N <- matrix(rnorm(2000 * d), ncol = 2)
V <- predict(model, x = N)

## Compare with training data
layout(t(1:2))
opar <- par(pty = "s", pch = 20, cex = 0.7)
plot(U[1:2000,], xlab = expression(U[1]), ylab = expression(U[2]))
plot(V,          xlab = expression(V[1]), ylab = expression(V[2])) # => not close anymore!
par(opar)
layout(1)

My colleague saved the weights and whole model he trained based on the above code and if I pass 'N' through those then the samples are also off (more mass towards the corners). Same the other way around (if I send him my trained model/weights). What has possibly changed that could affect such a serious difference?

I saw on https://github.com/t-kalinowski/deep-learning-with-R-2nd-edition-code/issues/6#issuecomment-1517721141 that one might need to tell the optimizer before fit() which variables it will be modifying... Is this related? But why are the losses close yet the samples so different (they are always symmetric, more normally distributed but should be asymmetric)

Below is more information about the two sessions (mine, my colleague). The only difference we found is that if we both run class(model), then his output starts with "keras.engine.training.Model" and mine with "keras.engine.functional.Functional" (and then with "keras.engine.training.Model"). But even calling keras:::predict.keras.engine.training.Model() directly did not make a difference. Nothing in the above code was modified from the previous point this was working for me, so it must be due to a change in TensorFlow/Keras (perhaps on macOS only?). Any hunch? I'm happy to provide (even) more details.

Thanks & cheers, Marius

Info about my session

Python, TensorFlow, Keras were installed via:

install.packages("remotes")
remotes::install_github("rstudio/keras")
reticulate::install_python()
keras::install_keras()

reticulate::py_config() shows:

python:         /Users/mhofert/.virtualenvs/r-tensorflow/bin/python
libpython:      /Users/mhofert/.pyenv/versions/3.9.18/lib/libpython3.9.dylib
pythonhome:     /Users/mhofert/.virtualenvs/r-tensorflow:/Users/mhofert/.virtualenvs/r-tensorflow
version:        3.9.18 (main, Feb 29 2024, 14:28:41)  [Clang 15.0.0 (clang-1500.1.0.2.5)]
numpy:          /Users/mhofert/.virtualenvs/r-tensorflow/lib/python3.9/site-packages/numpy
numpy_version:  1.24.3
tensorflow:     /Users/mhofert/.virtualenvs/r-tensorflow/lib/python3.9/site-packages/tensorflow
NOTE: Python version was forced by import("tensorflow")

sessionInfo() shows (note: I also installed the R package tensorflow in version 2.13.0 but it didn't solve the problem):

## Output:
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.3.1

## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

## time zone: Asia/Hong_Kong
## tzcode source: internal

## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base

## other attached packages:
## [1] keras_2.13.0      tensorflow_2.15.0

## loaded via a namespace (and not attached):
##  [1] R6_2.5.1          base64enc_0.1-3   Matrix_1.6-1.1    lattice_0.21-9
##  [5] reticulate_1.35.0 magrittr_2.0.3    generics_0.1.3    png_0.1-8
##  [9] lifecycle_1.0.4   cli_3.6.2         grid_4.3.2        zeallot_0.1.0
## [13] tfruns_1.5.2      compiler_4.3.2    rprojroot_2.0.4   here_1.0.1
## [17] whisker_0.4.1     Rcpp_1.0.12       rlang_1.1.3       jsonlite_1.8.8

Info about my colleague's session

His reticulate::py_config() shows:

python:         C:/Users/avina/AppData/Local/r-miniconda/envs/r-reticulate/python.exe
libpython:      C:/Users/avina/AppData/Local/r-miniconda/envs/r-reticulate/python36.dll
pythonhome:     C:/Users/avina/AppData/Local/r-miniconda/envs/r-reticulate
version:        3.6.12 |Anaconda, Inc.| (default, Sep  9 2020, 00:29:25) [MSC v.1916 64 bit (AMD64)]
Architecture:   64bit
numpy:          C:/Users/avina/AppData/Local/r-miniconda/envs/r-reticulate/Lib/site-packages/numpy
numpy_version:  1.19.5

His sessionInfo() shows:

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252    LC_MONETARY=English_Canada.1252
[4] LC_NUMERIC=C                    LC_TIME=English_Canada.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] keras_2.9.0      tensorflow_2.9.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3    here_1.0.1      lattice_0.20-41 png_0.1-7       rprojroot_2.0.3 zeallot_0.1.0
 [7] rappdirs_0.3.3  grid_4.0.2      R6_2.5.1        jsonlite_1.8.0  magrittr_2.0.3  cli_3.2.0
[13] rlang_1.0.2     tfruns_1.5.0    whisker_0.4     Matrix_1.2-18   reticulate_1.24 generics_0.1.2
[19] tools_4.0.2     compiler_4.0.2  base64enc_0.1-3
my_output
mhofert commented 9 months ago

I cleaned everything (Python, TensorFlow, Keras) and installed Keras the way I used to do again (essentially manually). Now it did run without errors but also produced wrong samples. I then realized that

install.packages("keras") 
reticulate::install_python() 
keras::install_keras() 

is essentially doing the same thing -- and actually ignores whatever I install manually (conda, location of virtual environments...). I then looked into keras::install_keras() and realized that it uses version = "default" as default, which is 2.13 (but I know that my colleague used tensorflow 2.15 and got the code to produce the correct samples). I then did:

install.packages("keras") 
reticulate::install_python() 
keras::install_keras(version = "release") 

and it solved the problem! This is reproducible, if I call keras::install_keras() again, it fails again. As I mentioned before, note that there is nothing that indicates the failure (very similar loss values, no indication of wrong training).

Here is a plot of the correct samples:

Screenshot 2024-03-02 at 14 38 20
t-kalinowski commented 9 months ago

Hi, thanks for reporting.

Running your code, I can't reproduce the issue. I suspect that this ultimately boils down to an issue with older builds of tensorflow-metal or tensorflow-macos, the M1 specific builds provided by Apple. The early versions of them had some bugs related to random tensor generation, and it's possible the current versions have them too.

Fortunately, beginning with TF 2.16 (available as an RC now, should be in release soon), we'll no longer need to install tensorflow-macos, as the necessary parts to make tensorflow work on M1 macs are now part of the official build.

If for some reason you require running an older version of tensorflow on an M1 mac, you can skip tensorflow-macos and force the tensorflow-cpu package.

tensorflow::install_tensorflow(metal = FALSE, version = "2.13-cpu")