Closed andapra closed 1 year ago
I just resolve it after intensively doing trial and error. The main cause of kernel dies during training is because I use image augmentation in rotation and flip (tf.keras.layers.RandomRotation, tf.keras.layers.RandomFlip)
I found that the TensorFlow GitHub discussion for TensorFlow 2.9 and 2.10 have bugs regarding the augmentation https://github.com/tensorflow/tensorflow/issues/56242 some said to do the version downgrade 2.8.3 but I found that resolve to be doubtful because I tried and get error during importing TensorFlow library
=================== My system: AMD Ryzen 7 5700G 8 Cores 18 Threads RX 6700 XT DDR 612 GB DDR4 RAM 64 GB GPU Driver update latest Miniconda with Python=3.9 Windows 11 WSL Ubuntu-20.04
Library tensorflow 2.10 tensorflow-directml-plugin
Did you slove this problem?how did you solve it?
Hi, I was trying to train an image classification using tensorflow-directml. The tensorflow version I use is tensorflow-cpu 2.9 with directml-plugin. The problem I encounter is during the training, the kernel always dead, and whenever I rerun until the training part it always causes dead.
Here are the logs from juptyer-notebook 2022-10-30 07:38:40.547593: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2022-10-30 07:38:40.547783: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 40804 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id:)
2022-10-30 07:38:44.853552: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2022-10-30 07:38:46.768811: F tensorflow/c/logging.cc:43] Check failed: it != allocations_byid.end()
[I 07:39:03.462 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
The failed log said Check Failed allocation_by_id_end(). I have tried to google it for days but no resolution.
The solution that I had tried but had no effect is to reduce the batch job for training
======================= My system: AMD Ryzen 7 5700G 8 Cores 18 Threads RX 6700 XT DDR 612 GB DDR4 RAM 64 GB GPU Driver update latest Miniconda with Python=3.9 Windows 11 update latest