hughperkins / tf-coriander

OpenCL 1.2 implementation for Tensorflow
Apache License 2.0
792 stars 90 forks source link

GPU runs slower than CPU #54

Closed Valentin4869 closed 7 years ago

Valentin4869 commented 7 years ago

When I run the logistic regression example, each epoch takes about 5 seconds on the GPU (RADEON RX460), while it takes about 0.3 seconds on the CPU (i7 4770). My operating system is Ubuntu 16.04 LTS. Note that I'm running the code on the CPU using python2.7, while I use python3 when running on GPU since it doesn't work any other way. But what could be the reason making the GPU run significantly slower?

hughperkins commented 7 years ago

Firstly, you'll note that the NVIDIA® CUDA™ times are three times faster, but not the 15 times faster you're referring to for the CPU, https://github.com/hughperkins/tf-coriander/blob/master/doc/execution_speed.md . So, the question is not, and to be fair you're not asking this, but it's some what implied "Why is Coriander really slow?", but really it's, and to be fair this is technically what you're asking "why does this logistic regression model run really slowly on GPUs in general?"

And the answer is that, linear regression is a linear model. Not much computation happens. So, you basically spend all your time shuffling data backwards and forward to and from the GPU, in tiny batches. Whereas the CPU already has the data next to it, in main memory.

If we look at the details, all the computation is happening in the one single matrix multiply, https://github.com/hughperkins/TensorFlow-Examples/blob/enforce-gpu/examples/2_BasicModels/logistic_regression.py#L34 . What are the dimensions of that multiply? They are:

(100 x 784) * (784 x 10)

That works out at how many multiplications? It is: 100 * 784 * 10 multipications, which is ~784k miltiplications, not very many. You can increase this a bunch by increasing the batch size by 10, to 100. You'll see the epoch time drops from, well, on a K520, it drops from ~10 seconds per epoch, down to 1.2 seconds per epoch. On the other hand the accuracy kind of becomes junky.

What GPUs are really good at is nets where the ratio of computation to data is very high. I've seen this termed 'computational intensity', though I'm not sure if thats a technical term as such.

To what extent does this approximately answer your question?

Valentin4869 commented 7 years ago

Thanks. That answers my question.

hughperkins commented 7 years ago

Cool :-)

jasonm commented 7 years ago

Thanks for all your work putting tf-coriander together! Came here with a similar question, mostly to understand the CPU/memory bus/GPU-boundedness of my models & tf-coriander.

I'm looking at https://github.com/fchollet/keras/blob/master/examples/pretrained_word_embeddings.py (a text classifier with Embedding + a stack of Conv1D/MaxPooling1D + Dense) and seeing roughly:

CPU: 2.5 GHz Intel Core i7 Theano backend 35sec epochs

GPU: AMD Radeon R9 M370X 2048 MB with tf-coriander 550sec epochs

I reduced batch_size from 128 to 32 to fit into GPU RAM. The tf-coriander model also pins the CPU. Using the prebuilt MacOS Sierra whl.

Is this a helpful/expected anecdote? I realize I'm casually kicking the tires and appreciate you're probably very busy 😄 I appreciate any pointers you might provide so I can characterize/better understand this experience.

hughperkins commented 7 years ago

@jasonm

Ah, no, thats actually almost certainly because I havent implemented conv yet. So anything convy is running on the CPU. Thats unfortunate, since convolutions are pretty much the only/main thing that GPUs do well :-P . Matrix multiplications work well too, to be fair. But convolutions have an extremely high computational intensity.

What do we need to do to enable conv? The NVIDIA® cuDNN API is actually partially implemented inside coriander, https://github.com/hughperkins/coriander-dnn. And I made a start at adding some plumbing to connect tensorflow with this implementation, ie https://github.com/hughperkins/tf-coriander/compare/adding-conv2 . It's actually kind of pure engineering drudgery, and I got a bit bored, when compared to the cool interesting challenges of trying to get Thrust working, https://github.com/hughperkins/coriander/blob/multiple-walks/doc/walking.md but conv really should be implemented in tf-coriander. If someone has a moment to help out with this plumbing operation that'd be kind of neat :-)