Open mhofert opened 9 months ago
I cleaned everything (Python, TensorFlow, Keras) and installed Keras the way I used to do again (essentially manually). Now it did run without errors but also produced wrong samples. I then realized that
install.packages("keras")
reticulate::install_python()
keras::install_keras()
is essentially doing the same thing -- and actually ignores whatever I install manually (conda, location of virtual environments...). I then looked into keras::install_keras() and realized that it uses version = "default" as default, which is 2.13 (but I know that my colleague used tensorflow 2.15 and got the code to produce the correct samples). I then did:
install.packages("keras")
reticulate::install_python()
keras::install_keras(version = "release")
and it solved the problem! This is reproducible, if I call keras::install_keras() again, it fails again. As I mentioned before, note that there is nothing that indicates the failure (very similar loss values, no indication of wrong training).
Here is a plot of the correct samples:
Hi, thanks for reporting.
Running your code, I can't reproduce the issue. I suspect that this ultimately boils down to an issue with older builds of tensorflow-metal
or tensorflow-macos
, the M1 specific builds provided by Apple. The early versions of them had some bugs related to random tensor generation, and it's possible the current versions have them too.
Fortunately, beginning with TF 2.16 (available as an RC now, should be in release soon), we'll no longer need to install tensorflow-macos
, as the necessary parts to make tensorflow
work on M1 macs are now part of the official build.
If for some reason you require running an older version of tensorflow on an M1 mac, you can skip tensorflow-macos
and force the tensorflow-cpu
package.
tensorflow::install_tensorflow(metal = FALSE, version = "2.13-cpu")
Hi,
After a recent update of Python/TensorFlow/Keras, a minimal working example (MWE) I used to run to produce samples from a target distribution does not produce such samples anymore (close but clearly from a different distribution; see the attached screenshots below). After more than 24h searching the needle in the haystack, I'm still clueless. A colleague ran the MWE under his setup on Windows with older versions of Python/TensorFlow/Keras and obtained the correct samples as we always did. And so did another colleague on macOS. Our loss functions also produce very similar values, so we are still unsure whether it's keras' fit() or predict().
Here is the full story which, by now, I consider a 'bug' in the hope others may see this post when realizing their networks don't train/predict properly anymore. The biggest issue is that this can remain entirely undetected as the loss functions don't indicate any problem... hence this post. Also, it means that certain R packages (e.g. 'gnn') currently can work for some (my colleague) but not others (myself) without any warning.
The MWE trains a single-hidden-layer neural network (NN) to act as a random number generator (RNG). I pass iid N(0,1) samples through the NN and then compare them to given dependent multivariate samples from some target distribution (here: scaled ranks of absolute values of correlated normals) with the loss function MMD (maximum mean discrepancy) that we implemented (jointly with the NN, this is called a GMMN, a generating moment matching network).
The MWE below worked well with R running inside a virtual Python environment (installed with Minimorge3 on my M1 14" MacBook Pro, first gen) and then TensorFlow installed via "conda install -c apple tensorflow-deps" and "python -m pip install tensorflow-metal". This was until about a year ago. When I wanted to run the MWE again this week, I received:
After reinstalling Python/TensorFlow/Keras in the exact way as I used to do, I still received this error. I then read on https://github.com/t-kalinowski/deep-learning-with-R-2nd-edition-code/issues/3 that the following is the (now) recommended way to install Python/TensorFlow/Keras on all platforms, so I did:
After that, the MWE ran again. However, it did not properly generate samples from the target distribution anymore. I cannot go back to older versions of the R package 'keras' as then the above error appears again.
Here is the MWE with sessionInfo() etc., also for the outputs of my colleague (on Windows). Again, he obtains very similar loss values, but my generated samples look like normals, not asymmetric anymore as they should (and his are fine).
My colleague saved the weights and whole model he trained based on the above code and if I pass 'N' through those then the samples are also off (more mass towards the corners). Same the other way around (if I send him my trained model/weights). What has possibly changed that could affect such a serious difference?
I saw on https://github.com/t-kalinowski/deep-learning-with-R-2nd-edition-code/issues/6#issuecomment-1517721141 that one might need to tell the optimizer before fit() which variables it will be modifying... Is this related? But why are the losses close yet the samples so different (they are always symmetric, more normally distributed but should be asymmetric)
Below is more information about the two sessions (mine, my colleague). The only difference we found is that if we both run class(model), then his output starts with "keras.engine.training.Model" and mine with "keras.engine.functional.Functional" (and then with "keras.engine.training.Model"). But even calling keras:::predict.keras.engine.training.Model() directly did not make a difference. Nothing in the above code was modified from the previous point this was working for me, so it must be due to a change in TensorFlow/Keras (perhaps on macOS only?). Any hunch? I'm happy to provide (even) more details.
Thanks & cheers, Marius
Info about my session
Python, TensorFlow, Keras were installed via:
reticulate::py_config() shows:
sessionInfo() shows (note: I also installed the R package tensorflow in version 2.13.0 but it didn't solve the problem):
Info about my colleague's session
His reticulate::py_config() shows:
His sessionInfo() shows: