keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.63k stars 19.42k forks source link

API DESIGN REVIEW multi-gpu-ratios #9155

Closed jinkos closed 6 years ago

jinkos commented 6 years ago

I am submitting the following design suggestion document...

API Design Review Document

see My Github Tutorial

Summary

A modified version of keras.utils.multi_gpu_model() that takes an extra parameter: a list of ratios denoting how the GPU load should be split. e.g...

multi_gpu_model(model,gpus=[0,1],ratios=[4,3]) will spread the samples per batch roughly in the ratio of 4:3 between GPU:0 and GPU:1

fchollet commented 6 years ago

Reposting what I posted on the mailing list thread, so other people can reply to it here:

I think this would be a bit of a niche functionality. It is generally a reasonable assumption that all GPU devices on a machine have the same capabilities.

jinkos commented 6 years ago

The problem for me was that I already had a GPU and wanted to buy a new one, so that I would have two. But people keep bringing out new, faster GPUs so the new one I bought was significantly faster.

For people who can only afford to occasionally buy one new GPU, this is quite a big deal. It's a shame if you are stuck at the speed for your slowest GPU.

I think for people building their own Linux boxes on a shoestring, this must be quite a common problem.

James

ahundt commented 6 years ago

I think this would be a bit of a niche functionality. It is generally a reasonable assumption that all GPU devices on a machine have the same capabilities.

I believe this change is valuable for a very important reason: GPUs are very expensive and the proposed change better supports those that can not afford to buy many of the same GPU.

I'm a grad student and I bought one pre-owned to get started with deep learning. I then bought another different pre-owned GPU several months later with more memory once I decided it was worth more investment.

Update 2018-01-25: I also know of several other people I've collaborated with a bit on open source projects both inside and outside the US with multiple different GPUs in their machine.

ahundt commented 6 years ago

@jinkos could you also consider adding a StagingArea to your changes? I believe your proposed change + a StagingArea could make it possible to get a very substantial performance boost if you have two of the same or two different GPUs.

I started such a change at https://github.com/keras-team/keras/compare/master...ahundt:StagingArea but the dimensions are off and I haven't had the time to fix it.

jinkos commented 6 years ago

could you also consider adding a StagingArea to your changes? I believe your proposed change + a StagingArea could make it possible to get a very substantial performance boost if you have two different GPUs

StagingArea? TF add to their bloated toolkit so often that I have missed the whole StagingArea thing. I will experiment with your code for my current Kaggle competition and report back when I have properly got my head around it.

Happy to work on anything that makes things faster. If Keras can do for pipelining what it has done for modelling it would be unstoppable. But that's a kind of architecture thing and I don't think it can be tackled with a tweak here and a tweak there.

Interfaces for pipelining seem to be changing faster than interfaces for modelling at the moment. Is there a Keras strategy/view for this? This seems crucial, to me.

The 'n different GPUs problem' and the ratios solution is a trivial few lines of code, it's already written and quite well tested.

ahundt commented 6 years ago

@TimZaman knows about this intimately. He gave some useful details on another pull request I made a while ago which you can see at https://github.com/keras-team/keras/pull/6928#issuecomment-313841732. Since the PR is so long it doesn't always show up, you may have to click "View more" twice, then search for the username TimZaman, there are pictures of tensorboard there.

TimZaman commented 6 years ago

Fixing skewed GPU ratios.

First response Don't fix this. Make sure your GPUs are aligned.

Nuanced response Use https://github.com/uber/horovod/tree/master/horovod to distribute keras over multiple GPUs, this is also faster than what's in Keras itself, and easy to set up. Then, per process (so per GPU) you give it a different batch size, to fix your GPU muscle misalignment.

Optimal graph

Keras puts the user first, with a proper tradeoff with speed. If you deeply care about perf, use tf.keras instead (i.e. faster bias adds, batch norm ops). Also, your datapipe should be in pure tf for optimal perf. Provided you have an optimal graph for your model, and an optimal graph for your data input; create a tf.StagingArea to connect those two. Put that area on the GPU explicitly; that will mean that the model (running on GPU) doesn't have to wait for CPU-GPU transfers. What you should do here: before step 1: put 1 batch in the buffer with every step: take 1 step from the buffer (your model is connected to this) and put one in the buffer too. Putting something in the buffer can be done by adding **kwargs to your fit which would then be passed on into the tensorflow_backend's Function so that that "put op" will be run with each step.

@ahundt coming to GTC?

fchollet commented 6 years ago

this is also faster than what's in Keras itself

What can we do to improve multi_gpu_model in Keras, especially performance for small models? This is an outstanding item in our "requests for contributions" list.

TimZaman commented 6 years ago

@fchollet iirc multi_gpu_model merges when it gets to the loss function; instead of having a model-parallel loss computation. Furthermore, the distinct processes used by horovod means you don't have to optimize [or multiprocess] your datapipe as much as vanilla Keras, even if you have a homebrew np datapipe.

Another problem is that for multi-gpu, the StagingArea won't work as well; since your StagingArea should be on GPU ideally; you need to add the StagingArea per GPU. Since multi_gpu_model does the split for you, you cannot split anything over the gpus before you enter multi_gpu_model. The best one could do is add a custom layer with the tf.StagingArea; so that this custom layer is on each GPU, which might not be a bad idea at all; I realize while I am writing this.

@fchollet are you doing a book signing at GTC?

ahundt commented 6 years ago

Don't fix this. Make sure your GPUs are aligned.

By aligned do you mean the identical model?

If you deeply care about perf, use tf.keras instead (i.e. faster bias adds, batch norm ops).

I'll give it another try when 1.5 is released, last time I tried tf.keras it choked on import tf.keras.backend as K and I was too short on time to debug.

@ahundt coming to GTC?

It sounds great but I don't think I can get funding for it.

TimZaman commented 6 years ago

Don't fix this. Make sure your GPUs are aligned. By aligned do you mean the identical model?

I mean: don't mix different gpu types in one system.

fchollet commented 6 years ago

@ahundt you probably want

from tensorflow import keras
K = keras.backend
fchollet commented 6 years ago

are you doing a book signing at GTC?

This was in the plans but I haven't had any update on it for a while. Maybe?

fchollet commented 6 years ago

Closing since we won't implement this API change.

jinkos commented 6 years ago

Thanks for indulging me it's been a great discussion.

On 26/01/2018 22:33, François Chollet wrote:

Closed #9155 https://github.com/keras-team/keras/issues/9155.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/keras-team/keras/issues/9155#event-1444489269, or mute the thread https://github.com/notifications/unsubscribe-auth/AA_N3iXS09wNt125X1w73KkeCQoFV6z8ks5tOlLEgaJpZM4Rm_U5.

ahundt commented 6 years ago

I mean: don't mix different gpu types in one system.

Too late, but so far together they have certainly been faster than one 👍. Prices are too high for me to do anything differently at the moment, thanks bitcoin. :-)

ozabluda commented 6 years ago

@TimZaman If you deeply care about perf, use tf.keras instead (i.e. faster bias adds, batch norm ops).

Why are those faster in tf.keras?