Closed jonbry closed 6 months ago
Here's the plot of the mini Xception model with keras3
using the Mac:
I forgot to copy the session and package info, but I can run it again if it's helpful. It's crazy that running this model with an old NVIDA graphics card (even one that is ~7 years old) is 4x faster than an M1 Mac. It'll be interesting how this changes with MLX.
Looking into this now. If I'm understanding correctly, the model fails to train only when on Linux using the GPU, and it trains fine when using the CPU only on Mac?
model fails to train only when on Linux using the GPU
It only fails when using keras3
on the Linux computer, which should be using the GPU. It works if I use the latest keras
package on the Linux computer and when using keras3
on the Mac.
My best guess right now is that this is an upstream bug related to CUDA/TensorFlow/Keras.
Probably, disabling the GPU on Linux (e.g., Sys.setenv(CUDA_VISIBLE_DEVICES="")
) would produce the same results seen on the mac.
My best guess right now is that this is an upstream bug related to CUDA/TensorFlow/Keras. Probably, disabling the GPU on Linux (e.g.,
Sys.setenv(CUDA_VISIBLE_DEVICES="")
) would produce the same results seen on the mac.
Ok, I'll give this a shot. So the bug could affect the keras3
package and not keras
?
I can reproduce on Linux with a GPU:
I can reproduce on Linux with a GPU:
I can't tell you how happy this makes me. I was reading through the keras 3 migration docs last night to see if something may have changed before I realized I could just try it on the Mac. I was very confused when it worked on the Mac, but not with Linux. I'm running it (slowly on the CPU right now and will let you now how it turns out.
Same behavior with use_backend("jax")
I'm calling it at 31 epochs. Same results with keras3
/CPU as keras3
/GPU. If I run tf$config$list_physical_devices("GPU")
and it return list()
, does that means it's using CPU for both keras and tensorflow?
If I run tf$config$list_physical_devices("GPU") and it return list(), does that means it's using CPU for both keras and tensorflow?
Yes (assuming keras is using the tensorflow backend)
Hmm, using TF 2.16, with {keras}
2 (via Sys.setenv(TF_USE_LEGACY_KERAS = "1")
, the issue is still present. I'm starting to suspect the issue is with TF 2.16. Will try TF 2.15 next.
Note to self, needed some code changes to get this working:
Sys.setenv(TF_USE_LEGACY_KERAS = "1")
reticulate::register_class_filter(function(x) {
if(!is.na(m <- match("keras.src.models.model.Model", x)))
x <- unique(append(x, "keras.engine.training.Model", after = m))
if(!is.na(m <- match("keras.src.models.sequential.Sequential", x)))
x <- unique(append(x, "keras.engine.sequential.Sequential", after = m))
x
})
load_model <- load_model_tf
dim <- function(x) unlist(x$shape)
Yep, using tensorflow. On the Mac, I am using tensorflow metal, which is enabled for GPU. Does that mean that running the keras3
model on the Mac was on the GPU rather than the CPU?
Everything seems to be working fine with R {keras}, TF 2.15, Keras 2.15 (default keras in TF 2.15)
It's noteworthy that training is almost 2x slower in TF 2.16 vs TF 2.15 for this example.
Running the same code in Python also fails to train the model, this confirms that it's not a bug in the R interface, but somewhere else in the stack:
Python code (adapted from here)
Running the same code in Python also fails to train the model, this confirms that it's not a bug in the R interface, but somewhere else in the stack
Since this is happening with TensorFlow and JAX on Linux, regardless on whether it's running on the CPU or GPU, would it make sense for me to open an issue in the keras repository? I went through the keras issues that have been created in the last few months and I didn't see any that related to this issue. It seems most people experience this issue when they use the wrong activation in the last layer (not sigmoid), which isn't the case here.
It also appears to affect this example with Linux using keras3
. For some reason it actually runs without throwing the ellipse error but the accuracy is still 50.
Yes please! I was planning to open an issue today... ~I think it might be related to the serializing callback and tied to https://github.com/rstudio/reticulate/issues/1601.~ Still trying to get the full picture.
If you file an issue upstream, please link back here and I'll add context as needed.
I just opened an issue with keras: https://github.com/keras-team/keras/issues/19623
Looks good:
Since Keras 3.3.3 fixes the issue, would you like me close the issue or keep it open until v3.3.3 gets included in keras3
?
Thanks for all of your help getting this resolved!
@jonbry This report helped flush out two very excellent bugs.
Fix 1: https://github.com/keras-team/keras/commit/5883a25f1b7c6eacc3f21f1821751a4109700796 Fix 2: https://github.com/rstudio/reticulate/pull/1602
A heartfelt Thank You!
Please keep the bug reports coming!
cc: @fchollet
I have noticed a strange issue with the mini Xception model from Chapter 9. When running it on the Linux machine, it runs smoothly with
keras
but not withkeras3
. I haven't run into this issue with other examples since the latest 2.15 release ofkeras
and just wanted to see what I may be doing to cause the issue.The linux machine has both
keras 2.15
andkeras3 0.2.0
installed. When running the code withkeras3
, only thekeras3
package is attached (terminal is showingr-keras
environment). Here's the metric plots using both packages (I restarted Rstudio between each example):With
keras
package:With
keras3
package:The accuracy using
keras3
is basically 50% across all epochs, which was strange. I ran the same code on a Mac that only hadkeras3
installed and got similar results to the model when using thekeras
package. You can find thesessionInfo()
andpy_list_packages
below for each package on the Linux machine:Mini Xception-like model with keras
Mini Xception with
keras3
Let me know if there is any additional information I can provide to help troubleshoot the issue.
Thank you!