Closed intervolga-school closed 1 year ago
Last time it was optimized by @fsx950223 with https://github.com/tensorflow/addons/pull/2402
Performance issue from other ops, you could test it via follow example:
# -*- coding: utf-8 -*-
"""edt model speed.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1yS0EHLwV09scUY19f87Ow7DZb89jE89c
"""
import numpy as np
import time
import tensorflow as tf
from keras import layers, losses, models
from keras.utils.losses_utils import ReductionV2 as Reduction
from tensorflow_addons.image import euclidean_dist_transform
# Check: some Colab instances crashes with EDT on GPU
# If so, you need to wait about 1 day to obtain another VM instance (i suppose) or try another patform
with tf.device('/GPU:0'):
check_data = np.random.uniform(size=(1, 768, 768, 1)) * 255.
euclidean_dist_transform(check_data.astype('uint8'), dtype='float32')
class BoundarySparseCategoricalLoss(losses.LossFunctionWrapper):
""" Proposed in: 'Boundary loss for highly unbalanced segmentation'
Implements Equation (5) from https://arxiv.org/pdf/1812.07032v4.pdf
"""
def __init__(self, from_logits=False, pin_cpu=True, reduction=Reduction.AUTO, name='boundary_sparse_categorical_loss'):
super().__init__(boundary_sparse_categorical_loss, reduction=reduction, name=name, from_logits=from_logits, pin_cpu=pin_cpu)
def boundary_sparse_categorical_loss(y_true, y_pred, from_logits, pin_cpu):
device = '/CPU:0' if pin_cpu else '/GPU:0'
with tf.device(device):
y_pred = tf.convert_to_tensor(y_pred)
y_true = tf.cast(y_true, dtype='uint8')
channels = y_pred.shape[-1]
if channels is None:
raise ValueError('Channel dimension of the predictions should be defined. Found `None`.')
assert_true_rank = tf.assert_rank(y_true, 4)
assert_pred_rank = tf.assert_rank(y_pred, 4)
with tf.control_dependencies([assert_true_rank, assert_pred_rank]):
if from_logits:
if 1 == channels:
y_pred = tf.nn.sigmoid(y_pred)
else:
y_pred = tf.nn.sigmoid(y_pred)
axis_hwc = list(range(1, y_pred.shape.ndims))
has_true = tf.reduce_any(y_true == 1, axis=axis_hwc, keepdims=True)
has_false = tf.reduce_any(y_true == 0, axis=axis_hwc, keepdims=True)
if 1 == channels:
y_true = tf.cast(tf.one_hot(y_true[..., 0], 2, dtype=tf.int32), tf.uint8)
y_pred = tf.concat([1. - y_pred, y_pred], axis=-1)
y_false = 1 - y_true
import time
start = time.perf_counter()
d_true = euclidean_dist_transform(y_true, dtype=y_pred.dtype)
d_false = euclidean_dist_transform(y_false, dtype=y_pred.dtype)
end = time.perf_counter()
print("")
print(end - start)
distance = d_false * tf.cast(y_false, dtype=y_pred.dtype) - (d_true - 1.) * tf.cast(y_true, dtype=y_pred.dtype)
distance = tf.where(has_true & has_false, distance, 0.)
distance = tf.stop_gradient(distance)
loss = y_pred * distance
return tf.reduce_mean(loss, axis=-1)
data_x = np.random.uniform(size=(128, 768, 768, 1))
data_y = (np.random.uniform(size=(128, 768, 768, 1)) > 0.5).astype('int32')
data_y.sort(axis=1)
data_y.sort(axis=2)
data = tf.data.Dataset.from_tensor_slices((data_x, data_y))
data = data.batch(1)
data = data.repeat(5)
data = data.prefetch(tf.data.AUTOTUNE)
def test_with_loss(loss):
model = models.Sequential([
layers.Conv2D(64, 7, padding='same'),
layers.BatchNormalization(),
layers.Activation('relu'),
layers.Conv2D(1, 3, padding='same', activation='sigmoid'),
])
model.compile(
optimizer='adam',
loss=loss,
run_eagerly=False
)
start = time.time()
model.fit(data)
print(time.time() - start)
# loss without EDT 33s total, 32s from log
# loss with EDT pinned on CPU 83s total, 59s from log
# loss with EDT pinned on GPU 1462s total, 1443s from log
tf.config.run_functions_eagerly(True)
tf.config.set_soft_device_placement(False)
#test_with_loss('binary_crossentropy') # without edt
test_with_loss(BoundarySparseCategoricalLoss(pin_cpu=True)) # with edt on cpu
test_with_loss(BoundarySparseCategoricalLoss(pin_cpu=False)) # with edt on gpu
I made a mistake, I should trace it by nvprof and the kernel is really slow.
We had another proposed impl in:
https://github.com/tensorflow/tensorflow/issues/24410#issuecomment-811358508
Did you solve it?
I'm having the problem described here, but only on one of my computers. Basically a laptop gpu (Quadro T1000) beats our server gpu (Quadro RTX 5000). On my windows laptop it runs "smoothly" with 4 it/s using python3.10. On my ubuntu 18 desktop it runs with 0.3 it/s using python3.9.
Both pc's run the same binary tensorflow versions and the exact same training/evaluation code: tensorflow_gpu==2.7 tensorflow_addons==0.15.0 (I also tried tfa-nightly, v0.17.0 and v0.16.0)
I tried to debug using eagerly mode and found that EDT took 1.2 seconds for each image.
Currently i'm trying to build everything from source, but it still seems weird to me. Also i tried to remove the EDT parts and suddenly it ran blazingly fast on the desktop. On both setups the gpu is utilized almost 100% during training and evaluation.
TensorFlow Addons is transitioning to a minimal maintenance and release mode. New features will not be added to this repository. For more information, please see our public messaging on this decision: TensorFlow Addons Wind Down
Please consider sending feature requests / contributions to other repositories in the TF community with a similar charters to TFA: Keras Keras-CV Keras-NLP
System information
Describe the bug Last 6 month i used tensorflow-addons EDT in different models (interactive segmentation, semantic segmentation, ...) and found it was very slow. But simply moving EDT in dataset level or even pinning it to CPU got major performance improvement.
The link below contains one of my real use cases: boundary loss. It shows that using EDT on GPU is about 24 times slower then on CPU.
Another case i got can't be published. But moving EDT from model into dataset.map reduces step time from 2 second to 0.5 on RTX 2080 ti with batch size=1 and input size 768x768
Code to reproduce the issue https://colab.research.google.com/drive/1yS0EHLwV09scUY19f87Ow7DZb89jE89c?usp=sharing
Other info / logs loss without EDT 33s total, 32s from log loss with EDT pinned on CPU 83s total, 59s from log loss with EDT pinned on GPU 1462s total, 1443s from log