The issue of reduced half-precision speed when using ROCm devices.

KegangWangCCNU commented 10 months ago

Hello, I am currently using the AMD Instinct MI50 GPU to train models. It has 26 Tflops of fp16 and 13 Tflops of fp32 compute power, but it lacks tensor cores.

My experiments on PyTorch indicate that Torch can benefit from the MI50's fp16 capabilities, achieving a slight acceleration for both training and inference. However, when testing with Keras on a TensorFlow 2.13 backend, it seems that inference speed for ResNet101 is much slower compared to Torch. Moreover, there is no acceleration gain from mixed precision; in fact, enabling mixed precision during training actually slows down the process (whereas enabling AMP in Torch results in faster speeds).

I would like to understand if this discrepancy could be due to Keras lacking support for AMD GPUs or any GPUs without tensor cores but with half-precision capabilities?

I hope to achieve acceleration with the MI50 when fp16 mixed precision is enabled, rather than a slowdown.

Looking forward to your reply!

sachinprasadhs commented 10 months ago

Hi,

Thanks for reporting the issue.

Could you please try the same with Keras 3 and let us know your analysis.

!pip install keras==3.0.0

import os

os.environ['KERAS_BACKEND'] = 'tensorflow'

import keras
from keras import mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_global_policy(policy)

model = keras.applications.resnet.ResNet101(weights=None)
ipt = keras.random.normal((32,224,224,3), dtype="float32")
model(ipt)

%%timeit
model(ipt)

KegangWangCCNU commented 10 months ago

Thank you for your reply. I have retested using Keras 3.0 + TensorFlow 2.13 and did not see any improvement. The running speed has slowed down, much slower than torch.

![image](https://github.com/keras-team/keras/assets/132054009/687f104b-19fd-46bd-89a4-e521bc3acff2) When using torch as the backend, the problem becomes more severe: ![image](https://github.com/keras-team/keras/assets/132054009/bdaa7fff-8a0a-4078-b8cb-3b986cc3ddca) When I only use Torch and without Keras, the results look quite good; the GPU's computing power is fully utilized, and mixed precision acceleration can be used. A similar situation also occurs with TensorFlow, so I am almost certain that the problem lies with Keras. ![image](https://github.com/keras-team/keras/assets/132054009/dd380e18-8022-4af4-b962-9173cdfdb8ce)

I hope to continue using Keras for my work because it is user-friendly and very compatible with Jupyter Notebook, but currently I am unable to resolve its performance issues. I hope you can help me!

sachinprasadhs commented 10 months ago

To use Keras 3, please use TensorFlow 2.15 or greater. Follow the below compatibility guide.

To use Keras 3:

tensorflow==2.15.0 & keras==3.0.0
tensorflow==2.16.0 & keras==3.0.0

KegangWangCCNU commented 10 months ago

Thank you for the reply. I will try version 2.15 of tensorflow-rocm when it becomes available. However, I don't think it's an issue with the tensorflow version, as the same problem occurs with keras 3.0 using a torch backend. Torch performs well on its own, but becomes extremely slow when used with keras.

Furthermore, when working with tensorflow-rocm 2.13 and keras 2.13 together, I am unable to get half-precision acceleration from keras; however, I can achieve acceleration when using tensorflow-rocm alone. When enabling mixed precision in keras, it warns that cuda capability is below 7.0 which might be causing this issue since AMD GPUs do not have CUDA capability and thus half-precision might be manually disabled.

haifeng-jin commented 10 months ago

@KegangWangCCNU , thanks for the issue! With other frameworks on ROCm, do we have speedup if we use mixed precision? What is the expected speedup?

We do not have a testing env for ROCm devices unfortunately. So I really appreciate it if you could do the test here.

Thanks!

KegangWangCCNU commented 10 months ago

@haifeng-jin Yes, the GPU I am using is the Instinct MI50, which has double speed half-precision capability and lacks tensor cores. According to tests, it can achieve mixed precision acceleration in PyTorch (see screenshot above), and I have tested TensorFlow and JAX as well; both can be accelerated through half precision.

When I use these backends in Keras 3.0, computation becomes particularly slow, and mixed precision further slows down the speed. I suspect it may be related to CUDA capabilities; Keras checks if CUDA capability is higher than 7.0 and might judge that this GPU does not have FP16 capability. For example, in Torch all AMD GPUs are given a CUDA capability of 9.0 to avoid disabling any type of acceleration.

haifeng-jin commented 6 months ago

Since I do not have an environment to easily reproduce the problem. Unassigning myself and wait for another round of triage. I suggest mark it as "contribution-welcome".

github-actions[bot] commented 1 week ago

This issue is stale because it has been open for 180 days with no activity. It will be closed if no further activity occurs. Thank you.

keras-team / keras

The issue of reduced half-precision speed when using ROCm devices. #18885