Training model causes kernel dead and restarting

andapra commented 1 year ago

Hi, I was trying to train an image classification using tensorflow-directml. The tensorflow version I use is tensorflow-cpu 2.9 with directml-plugin. The problem I encounter is during the training, the kernel always dead, and whenever I rerun until the training part it always causes dead.

Here are the logs from juptyer-notebook 2022-10-30 07:38:40.547593: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2022-10-30 07:38:40.547783: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 40804 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: ) 2022-10-30 07:38:44.853552: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled. 2022-10-30 07:38:46.768811: F tensorflow/c/logging.cc:43] Check failed: it != allocations_byid.end() [I 07:39:03.462 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports

The failed log said Check Failed allocation_by_id_end(). I have tried to google it for days but no resolution.

The solution that I had tried but had no effect is to reduce the batch job for training

======================= My system: AMD Ryzen 7 5700G 8 Cores 18 Threads RX 6700 XT DDR 612 GB DDR4 RAM 64 GB GPU Driver update latest Miniconda with Python=3.9 Windows 11 update latest

andapra commented 1 year ago

I just resolve it after intensively doing trial and error. The main cause of kernel dies during training is because I use image augmentation in rotation and flip (tf.keras.layers.RandomRotation, tf.keras.layers.RandomFlip)

I found that the TensorFlow GitHub discussion for TensorFlow 2.9 and 2.10 have bugs regarding the augmentation https://github.com/tensorflow/tensorflow/issues/56242 some said to do the version downgrade 2.8.3 but I found that resolve to be doubtful because I tried and get error during importing TensorFlow library

=================== My system: AMD Ryzen 7 5700G 8 Cores 18 Threads RX 6700 XT DDR 612 GB DDR4 RAM 64 GB GPU Driver update latest Miniconda with Python=3.9 Windows 11 WSL Ubuntu-20.04

Library tensorflow 2.10 tensorflow-directml-plugin

owo931214 commented 1 year ago

Did you slove this problem?how did you solve it?

microsoft / tensorflow-directml-plugin

Training model causes kernel dead and restarting #325