microsoft / tensorflow-directml

Fork of TensorFlow accelerated by DirectML
Apache License 2.0
456 stars 32 forks source link

tensorflow/directml is slow compared to coremltools on MacOS #68

Open daverumph opened 3 years ago

daverumph commented 3 years ago

I'm trying to deploy a multi-model tool for mouse behavior classification on Linux, Windows & Mac. For Linux, I use tensor flow 1.15 directly, with Cuda drivers to access the GPU(s). For Mac, I translate the models into .mlmodel files using coremltools. For Windows, I'm trying to use tensorflow-directml in order to easily utilize whatever GPU (Nvidia or AMD) that is available. I'm finding that, on the same laptop with AMD GPU (a MacBook Pro), the tf-directml version runs about 3x slower than the mlmodel version in MacOS. Here are some stats:

Model                 mlmodel   tf-directml        notes
detection             0.033 sec   0.088 sec       based on inception resnet v2
pose                  0.067 sec.  0.248 sec.      8-stack hourglass

I realize I'm running a very early version. Do you expect the performance to improve substantially? Do you have a guess as to when we might see performance improvements?

adtsai commented 3 years ago

Hi Dave, as you mentioned we're still in a very early preview stage and thus far we've been focused more on bringing up functionality and stability, which means we haven't had a ton of opportunity to look at performance yet. As you've noticed, there's ample room for improvement! It's something we're aware of and we do expect to make substantial strides in GPU performance in future, although we don't yet have a concrete timeline for when that'll become available. One thing that would help us in our profiling and performance testing is if we could take a look at the types of models you're using. You mentioned inception-resnet-v2 - is there a particular implementation you're using that's available elsewhere e.g. on GitHub that we could take a look at?

daverumph commented 3 years ago

Hi Adrian,

Thanks for your reply.

Our version of inception resnet v2 is our own, but should be the same layers as published versions, including the one in the Keras models in the TensorFlow repository. We do add some input processing at the beginning and a detection head at the end. I’m attaching our version, which still uses the deprecated “slim” contrib package, as model_detection.py.

The other model that we need GPU acceleration for is a stacked hourglass heat map model, which we implemented based on a published paper. I’m attaching our implementation of that model as model_pose.py. We currently use a stack of 8, but have determined that accuracy, at least on mice, doesn’t suffer much when reducing the stack size to 4.

I hope this helps your acceleration efforts. Please let me know if there is anything else I can provide or do to help.

Regards models.zip