microsoft / tensorflow-directml

Fork of TensorFlow accelerated by DirectML
Apache License 2.0
457 stars 32 forks source link

LSTM 8x slow on gpu #367

Open onurberkay opened 2 years ago

onurberkay commented 2 years ago

image Train on 36090 samples 2022-04-25 21:56:12.505195: I tensorflow/stream_executor/platform/default/dso_loader.cc:97] Successfully opened dynamic library C:\Users\onurb\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python/directml.24bfac66e4ee42ec393a5fb471412d0177bc7bcf.dll 2022-04-25 21:56:12.506028: I tensorflow/stream_executor/platform/default/dso_loader.cc:97] Successfully opened dynamic library dxgi.dll 2022-04-25 21:56:12.509302: I tensorflow/stream_executor/platform/default/dso_loader.cc:97] Successfully opened dynamic library d3d12.dll 2022-04-25 21:56:12.961954: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:250] DirectML device enumeration: found 1 compatible adapters. 2022-04-25 21:56:12.962441: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 2022-04-25 21:56:12.966749: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:186] DirectML: creating device on adapter 0 (AMD Radeon(TM) Graphics) 2022-04-25 21:56:13.055907: I tensorflow/stream_executor/platform/default/dso_loader.cc:97] Successfully opened dynamic library Kernel32.dll 36090/36090 - 232s - loss: 0.0014 - acc: 0.0396 Train on 36090 samples

when using only cpu takes 30-40s there is huge difference. Also look like Gpu not taking load. I am using 4750u apu

PatriceVignola commented 2 years ago

Is there a repro script that you would be able to provide us? Otherwise, it would help if you could send us the device placement logs. Run tf.debugging.set_log_device_placement(True) before redirecting the output to a file.

onurberkay commented 2 years ago

out.txt @PatriceVignola just a simple code. need some libraries to run => pip install yfinance / pip install scikit-learn / pip install matplotlib https://www.online-python.com/2Pa6iM1QZ3

PatriceVignola commented 2 years ago

There are a 2 issues that I could notice here at a cursory glance:

  1. The model uses a Qr operator internally, which isn't supported on DML (it isn't supported on CUDA either, but they "fake" register it to run on the CPU in order to enable device colocation on CUDA). We can do the same thing that CUDA does here and register it the same way for DML, and we might see some marginal perf improvements.
  2. The fact that you only have 1% load on the GPU is worrying. On my desktop, I see at least 40% throughout the whole training process when running the script that you linked. We haven't really tested tensorflow-directml on AMD APUs yet, but our experience with many integrated graphics in the past is that it's just faster to run everything on the CPU. For integrated graphics to work, they have to be powerful enough to make it worth to transfer data between the CPU and the GPU. I'll see if I can get my hands on a 4750 and investigate more.
onurberkay commented 2 years ago

I have try a heavy model with dense on gpu its faster then cpu. Gpu usage stats low again but I think must be a problem about stats. When will be added first change or will be added? I can make tries any time. Thanks for answers model.add(Dense(2000,kernel_regularizer=regularizers.l2(0.00000000001))) model.add(Dense(2000,kernel_regularizer=regularizers.l2(0.00000000001))) model.add(Dense(2000,kernel_regularizer=regularizers.l2(0.00000000001))) model.add(Dense(2000,kernel_regularizer=regularizers.l2(0.00000000001))) model.add(Dense(2000,kernel_regularizer=regularizers.l2(0.00000000001))) model.add(Dense(2000,kernel_regularizer=regularizers.l2(0.00000000001)))

RichardErkhov commented 2 years ago

I might be too late, but I think 89c is the problem, try to cool it down, it might be just trottling issue.