EDT performance in real app: gpu-version is ~24 slower then cpu one

intervolga-school commented 2 years ago

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Colab
TensorFlow version and how it was installed (source or binary): default colab version
TensorFlow-Addons version and how it was installed (source or binary): binary, 0.15.0
Python version: 3.7
Is GPU used? (yes/no): yes

Describe the bug Last 6 month i used tensorflow-addons EDT in different models (interactive segmentation, semantic segmentation, ...) and found it was very slow. But simply moving EDT in dataset level or even pinning it to CPU got major performance improvement.

The link below contains one of my real use cases: boundary loss. It shows that using EDT on GPU is about 24 times slower then on CPU.

Another case i got can't be published. But moving EDT from model into dataset.map reduces step time from 2 second to 0.5 on RTX 2080 ti with batch size=1 and input size 768x768

Code to reproduce the issue https://colab.research.google.com/drive/1yS0EHLwV09scUY19f87Ow7DZb89jE89c?usp=sharing

Other info / logs loss without EDT 33s total, 32s from log loss with EDT pinned on CPU 83s total, 59s from log loss with EDT pinned on GPU 1462s total, 1443s from log

bhack commented 2 years ago

Last time it was optimized by @fsx950223 with https://github.com/tensorflow/addons/pull/2402

fsx950223 commented 2 years ago

Performance issue from other ops, you could test it via follow example:

# -*- coding: utf-8 -*-
"""edt model speed.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1yS0EHLwV09scUY19f87Ow7DZb89jE89c
"""

import numpy as np
import time
import tensorflow as tf
from keras import layers, losses, models
from keras.utils.losses_utils import ReductionV2 as Reduction
from tensorflow_addons.image import euclidean_dist_transform

# Check: some Colab instances crashes with EDT on GPU
# If so, you need to wait about 1 day to obtain another VM instance (i suppose) or try another patform

with tf.device('/GPU:0'):
  check_data = np.random.uniform(size=(1, 768, 768, 1)) * 255.
  euclidean_dist_transform(check_data.astype('uint8'), dtype='float32')

class BoundarySparseCategoricalLoss(losses.LossFunctionWrapper):
    """ Proposed in: 'Boundary loss for highly unbalanced segmentation'

    Implements Equation (5) from https://arxiv.org/pdf/1812.07032v4.pdf
    """

    def __init__(self, from_logits=False, pin_cpu=True, reduction=Reduction.AUTO, name='boundary_sparse_categorical_loss'):
        super().__init__(boundary_sparse_categorical_loss, reduction=reduction, name=name, from_logits=from_logits, pin_cpu=pin_cpu)

def boundary_sparse_categorical_loss(y_true, y_pred, from_logits, pin_cpu):
    device = '/CPU:0' if pin_cpu else '/GPU:0'
    with tf.device(device):
      y_pred = tf.convert_to_tensor(y_pred)
      y_true = tf.cast(y_true, dtype='uint8')

      channels = y_pred.shape[-1]
      if channels is None:
          raise ValueError('Channel dimension of the predictions should be defined. Found `None`.')

      assert_true_rank = tf.assert_rank(y_true, 4)
      assert_pred_rank = tf.assert_rank(y_pred, 4)

      with tf.control_dependencies([assert_true_rank, assert_pred_rank]):
          if from_logits:
              if 1 == channels:
                  y_pred = tf.nn.sigmoid(y_pred)
              else:
                  y_pred = tf.nn.sigmoid(y_pred)

          axis_hwc = list(range(1, y_pred.shape.ndims))
          has_true = tf.reduce_any(y_true == 1, axis=axis_hwc, keepdims=True)
          has_false = tf.reduce_any(y_true == 0, axis=axis_hwc, keepdims=True)

          if 1 == channels:
              y_true = tf.cast(tf.one_hot(y_true[..., 0], 2, dtype=tf.int32), tf.uint8)
              y_pred = tf.concat([1. - y_pred, y_pred], axis=-1)
          y_false = 1 - y_true

          import time
          start = time.perf_counter()
          d_true = euclidean_dist_transform(y_true, dtype=y_pred.dtype)
          d_false = euclidean_dist_transform(y_false, dtype=y_pred.dtype)
          end = time.perf_counter()
          print("")
          print(end - start)

          distance = d_false * tf.cast(y_false, dtype=y_pred.dtype) - (d_true - 1.) * tf.cast(y_true, dtype=y_pred.dtype)
          distance = tf.where(has_true & has_false, distance, 0.)
          distance = tf.stop_gradient(distance)

          loss = y_pred * distance

          return tf.reduce_mean(loss, axis=-1)

data_x = np.random.uniform(size=(128, 768, 768, 1))

data_y = (np.random.uniform(size=(128, 768, 768, 1)) > 0.5).astype('int32')
data_y.sort(axis=1)
data_y.sort(axis=2)

data = tf.data.Dataset.from_tensor_slices((data_x, data_y))
data = data.batch(1)
data = data.repeat(5)
data = data.prefetch(tf.data.AUTOTUNE)

def test_with_loss(loss):
  model = models.Sequential([
    layers.Conv2D(64, 7, padding='same'),
    layers.BatchNormalization(),
    layers.Activation('relu'),
    layers.Conv2D(1, 3, padding='same', activation='sigmoid'),
  ])
  model.compile(
      optimizer='adam',
      loss=loss,
      run_eagerly=False
  )

  start = time.time()
  model.fit(data)
  print(time.time() - start)

# loss without EDT               33s total,    32s from log
# loss with EDT pinned on CPU    83s total,    59s from log
# loss with EDT pinned on GPU  1462s total,  1443s from log
tf.config.run_functions_eagerly(True)
tf.config.set_soft_device_placement(False)
#test_with_loss('binary_crossentropy')                        # without edt
test_with_loss(BoundarySparseCategoricalLoss(pin_cpu=True))  # with edt on cpu
test_with_loss(BoundarySparseCategoricalLoss(pin_cpu=False)) # with edt on gpu

fsx950223 commented 2 years ago

I made a mistake, I should trace it by nvprof and the kernel is really slow.

bhack commented 2 years ago

We had another proposed impl in:

https://github.com/tensorflow/tensorflow/issues/24410#issuecomment-811358508

fjodborg commented 2 years ago

Did you solve it?

I'm having the problem described here, but only on one of my computers. Basically a laptop gpu (Quadro T1000) beats our server gpu (Quadro RTX 5000). On my windows laptop it runs "smoothly" with 4 it/s using python3.10. On my ubuntu 18 desktop it runs with 0.3 it/s using python3.9.

Both pc's run the same binary tensorflow versions and the exact same training/evaluation code: tensorflow_gpu==2.7 tensorflow_addons==0.15.0 (I also tried tfa-nightly, v0.17.0 and v0.16.0)

I tried to debug using eagerly mode and found that EDT took 1.2 seconds for each image.

Currently i'm trying to build everything from source, but it still seems weird to me. Also i tried to remove the EDT parts and suddenly it ran blazingly fast on the desktop. On both setups the gpu is utilized almost 100% during training and evaluation.

seanpmorgan commented 1 year ago

TensorFlow Addons is transitioning to a minimal maintenance and release mode. New features will not be added to this repository. For more information, please see our public messaging on this decision: TensorFlow Addons Wind Down

Please consider sending feature requests / contributions to other repositories in the TF community with a similar charters to TFA: Keras Keras-CV Keras-NLP

tensorflow / addons

EDT performance in real app: gpu-version is ~24 slower then cpu one #2658