[WIP] Correct data-parallel SGD implementation in Keras

bzamecnik commented 7 years ago

The goal is to implement data-parallel multi-GPU training with gradient-averaging properly in Keras (at least explicitly for TensorFlow backend).

In this issue I'd like to discuss a particular approach which tries to fix problems of current solutions. Since Keras seems not to be designed for data-parallel SGD, I'm trying to find ways how to modify or adapt the Keras code, while keeping the API philosophy. Since this problem is quite important for many people, including our team at @rossumai, I'd like to ask for advice. Any feedback is really welcome.

Quick outline of the data-parallel SGD algorithm

We use N identical model replicas (towers) to train on slices on a mini-batch. Model parameters are placed on a parameter server device (PS), CPU or one of the GPUs, computations are made on N worker devices. A minibatch of inputs is split into N sub-batches, distributed to each worker which computes the forward and backwards pass, the resulting gradients are sent to the PS device, averaged and used to update the weights, which are then copies back to the workers.

Previous experiments

As a baseline I checked the TensorFlow CIFAR 10 multi-GPU tutorial. It worked as expected for 1-4 GPUs (TensorFlow 1.2.1, CUDA 8.0, GTX 1070).

I tried the approach of kuza55/keras-extras, discussed earlier in other issues (#2436) and blog post Transparent Multi-GPU Training on TensorFlow with Keras, adapting MNIST MLP and CIFAR10 Keras examples (Keras 2.0.6, TensorFlow 1.2.1, CUDA 8.0). In practice using more than one GPU lead to decrease of performance. Between 2 and 4 GPUs there was a 2x speedup, however.

https://gist.github.com/bzamecnik/390d3802b31ce766fbf6fd6c9fd682d3

Problems with kuza55/keras-extras

After examining the graph in TensorBoard I discovered a problem in this approach: gradients are not computed in parallel on each device, but in whole on the parameter service device. Indeed each worker computes only the predictions which are distributed to PS and concatenated. The loss + gradients are computed there. Another potential problem is that the whole mini-batch is fed to each device which only takes it's slice. We waste our precious IO bandwidth.

Proposed fixes

gradients should be computed in each tower separately, then averaged on PS device
only tower sub-batch slices of input/labels transferred to each tower (not full batch)
should we use queue for providing inputs asynchronously?
are the model parameters properly placed on the PS device and shared?
in addition for a correct parallel SGD implementation we should incorporate corrections outlined in the Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour paper:
- scaled learning rate with warm-up
- momentum correction
- batch normalization correction
- gradient aggregation
  - for averaging we can put all the normalization terms inside the loss and then reduce by sum
  - "Normalize the per-worker loss by total minibatch size kn, not per-worker size n."
- random shuffling: "Use a single random shuffling of the training data (per epoch) that is divided amongst all k workers."
- gradients aggregatin should be done in parallel with back-propagation
  - "As soon as the gradient for a layer is computed, it is aggregated across workers, while gradient computation for the next layer continues."

Proposed implementation

Parallel losses and gradients - DataParallelOptimizer

Since the Keras API is at present not directly suitable for data-parallel SGD computation, in the first step of making a working prototype we can make different implementations of Optimizer and Model, let's say DataParallelOptimizer and DataParallelModel.

We need to compute loss and gradients in parallel. Tensor for loss is created by Keras within Model.compile() and stored as Model.total_loss. Gradients are computed in Optimizer.get_gradients() which is called in lazily created functions Model.{train,test,predict}_function() (called from fit(), etc.). This function accepts single loss tensor. Various optimizers then compute updates based on the gradients. The problem is a single loss tensor (which can be placed on one device) passed to Optimizer.get_updates().

So far the only way I see it so change Model.total_loss from a single tensor into a list of tensors, each of them able to be placed on a different device. DataParallelOptimizer wrapper class can derive from Optimizer and override get_gradients() to accept loss as a list of tensors and average them. This would be the place of synchronization of the GPU workers. The get_updates() function (implemented any of the wrapped Optimizer) just calls get_gradients(). Note that thanks to the option collocate_gradients_with_ops=True in TF implementation of K.gradients() the gradient ops would automatically be placed on the same device as the loss ops even though it's called outside compile() and the device scope. (TODO: issue link)

Model replication and feeding data - DataParallelModel

We need a Model which contains the replicas (towers) and provides the list of losses to the DataParallelOptimizer. We would adapt the code in the make_parallel() function from kuza55/keras-extras. The DataParallelModel would take via the constructor an instance of the basic model. Similar like in make_parallel() it would make N replicas of this model placed on different devices. We could try to set TF variable reuse after the first replica. Also we make the merged model, which concatenates the outputs, and use it for the actual training. Better then slicing the duplicate inputs we can pass the sub-batch inputs separately and route them to each replicate directly.

Then it would override compile() which has to call compile() on each replica (and the merged model) - in order to place losses and gradients - and gather total_loss operation from all replicas. In compile() we also wrap the provided optimizer with DataParallelOptimizer and inject both the total_loss list and the DataParallelOptimizer instance to the merged model. The rest of the methods in DataParallelModel will be proxied to the merged model.

In case we want to avoid slicing the inputs we could change the inputs in {train,test,predict}_function() and perform the slice in *_on_batch() functions.

Code

~~https://gist.github.com/bzamecnik/92607207af912ae53dd2aa557631b977~~

https://github.com/rossumai/keras-multi-gpu

I have prepared an implementation of DataParallelOptimizer and I'm working on DataParallelModel. The mechanism of the latter is not as clear at the moment. In the first stage I'd like to make a working prototype, then make experiments to show that the model produces correct results and that we obtain benefit from scaling to multiple GPUs. Next I wish to make the API cleaner. So far I think the code might be separate from Keras, since it will depend on TensorFlow explicitly and I'm not sure about Theano support.

If you read this rather longer I'd like to kindly ask for advice if you think this approach is feasible or you see any problems with that. Any feedback is really welcome. Thanks!

/cc @rossumai

pengpaiSH commented 7 years ago

@bzamecnik Thank you for your contribution!!! Finally, someone is facing this problem explicitly. As a big big big fan of Keras, the multi-GPUs training fashion is a concern for a long time. And I have been expecting that a somewhat API called train_distributed() should be supported in the near future. With that said, we only need to specify the GPU IDs if necessary without changing most of the codes, i.e. gradients averaging should be computed in the backends.

bzamecnik commented 7 years ago

I've updated the gists to fix some obvious bugs in the sketch code and added an application to MNIST CNN. Still there's a problem that just putting a list of Tensors into Model.total_loss is not safe as code in Model._make_*_function() expect a Tensor and wrap it in a list:

    def _make_test_function(self):
# ...
            self.test_function = K.function(inputs,
                                            [self.total_loss] + self.metrics_tensors,

The second argument output of K.function gets into tf.control_dependencies(). It expects a list of Tensors but gets a nested list.

A solution might be to modify also _make_*_functionI() to flatten the list of outputs passed to K.function.

Veitch-Li commented 7 years ago

I try to perform the experiment,I met a problem,who can help me?

Caused by op u'replica_1_1/model_1_sample_weights', defined at: File "mnist_cnn_data_parallel_example.py", line 94, in train_multi_gpu() File "mnist_cnn_data_parallel_example.py", line 81, in train_multi_gpu metrics=['accuracy']) File "/nccl/data_parallel_model.py", line 132, in compile replica.compile(optimizer, loss, metrics, loss_weights) File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 869, in compile name=name + '_sample_weights')) File "/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py", line 391, in placeholder x = tf.placeholder(dtype, shape=shape, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1522, in placeholder return gen_array_ops._placeholder(dtype=dtype, shape=shape, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 2021, in _placeholder name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2336, in create_op original_op=self._default_original_op, op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1228, in init self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Shape [-1] has negative dimensions [[Node: replica_1_1/model_1_sample_weights = Placeholder[dtype=DT_FLOAT, shape=[?], _device="/job:localhost/replica:0/task:0/gpu:1"]()]]

bzamecnik commented 7 years ago

@Veitch-Li Yes, so far there's it fails on this error. I'll try to figure out the cause and fix it now since I didn't have time in the last few days. Also I moved the code from a gist to a separate repo and created issue for this bug: https://github.com/rossumai/keras-multi-gpu/issues/1.

bzamecnik commented 7 years ago

@Veitch-Li I've found the cause of this error (https://github.com/rossumai/keras-multi-gpu/issues/1#issuecomment-321832736). The problem is in not feeding some placeholders in the replica models.

bzamecnik commented 7 years ago

Note: just a few days ago was published a example of data-parallel multi-gpu Keras fork based on MXNET which provides good scaling: https://devblogs.nvidia.com/parallelforall/scaling-keras-training-multiple-gpus/

waleedka commented 7 years ago

@bzamecnik You mentioned that an issue in the previous approach was that "gradients are not computed in parallel on each device". Is that speculation or confirmed? TF tries to put gradient ops on the same device as their corresponding ops, so while the loss is computed on the CPU as you stated, I expect the gradients to be computed on the GPUs in parallel. Am I missing something?

bzamecnik commented 7 years ago

@waleedka IIRC TensorBoard showed all gradient ops on the same device. Although gradient collocation is enabled in K.gradients() there was just a single loss operation, thus gradients were collocated with that on one device. To overcome that we need separate losses (and thus gradients) located on each replica device.

waleedka commented 7 years ago

Interesting! I got a different result. My implementation is slightly different from kuza55 but uses the same principal, concatenating the loss on the CPU. Tensorboard shows that the gradients are distributed across different GPUs. Screenshot below (colors indicate devices)

bzamecnik commented 7 years ago

@waleedka Wow, your result looks good. In my experiments with the kuza55 code and similar I got gradients only on the parameter server device. Gradients were only distributed in the TF CIFAR10 example. So I dug deeper in modifying Keras to compute replica losses separately, but so far I'm still facing problems. Is your implementation published somewhere so that we could compare and see what's the difference? Do you observe linear-like scaling without a huge drop after 2nd GPU?

ppwwyyxx commented 7 years ago

We use N identical model replicas (towers) to train on slices on a mini-batch. Model parameters are placed on a parameter server device (PS), CPU or one of the GPUs,

The PS doesn't have to be CPU or one of the GPUs. From my experience, placing the variables to ALL GPUs is usually faster.

bzamecnik commented 7 years ago

I tried to run the code with make_parallel() again on 4 GPUs and now to my surprise the gradients seem in TensorBoard to be evenly distributed (even after digging deeper into the graph). Still the computation is really slow. So that it means that we should not focus on distributing the gradients in some other way, but to figure out what really makes the computation slow.

# measurement after 4th epoch, when it stabilizes

# PS: gpu:0
# gpu_count=4, batch_size=256, (layers=5 x 1024), 5M params -> 41 s / epoch

# Gradients seem to be evenly distributed!

from __future__ import print_function

import keras
from keras.datasets import mnist
from keras.models import Model
from keras.layers import Dense, Dropout, Input, Lambda
from keras.layers.merge import concatenate
from keras.optimizers import RMSprop
from keras import backend as K
import os
import tensorflow as tf

# sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
sess = tf.Session()
K.set_session(sess)

gpu_count = len([dev for dev in os.environ.get('CUDA_VISIBLE_DEVICES', '').split(',') if len(dev.strip()) > 0])

batch_size = 256
layer_count = 5
layer_width = 1024
num_classes = 10
epochs = 20

def make_parallel(model, gpu_count):
    def get_slice(data, idx, parts):
        shape = tf.shape(data)
        size = tf.concat([shape[:1] // parts, shape[1:]], axis=0)
        stride = tf.concat([shape[:1] // parts, shape[1:] * 0], axis=0)
        start = stride * idx
        return tf.slice(data, start, size)

    outputs_all = []
    for i in range(len(model.outputs)):
        outputs_all.append([])

    #Place a copy of the model on each GPU, each getting a slice of the batch
    for i in range(gpu_count):
        with tf.device('/gpu:%d' % i):
            with tf.name_scope('tower_%d' % i) as scope:

                inputs = []
                #Slice each input into a piece for processing on this GPU
                for x in model.inputs:
                    input_shape = tuple(x.get_shape().as_list())[1:]
                    slice_n = Lambda(get_slice, output_shape=input_shape, arguments={'idx':i,'parts':gpu_count})(x)
                    inputs.append(slice_n)

                outputs = model(inputs)

                if not isinstance(outputs, list):
                    outputs = [outputs]

                #Save all the outputs for merging back together later
                for l in range(len(outputs)):
                    outputs_all[l].append(outputs[l])

    # merge outputs on CPU
    with tf.device('/gpu:0'):
        merged = []
        for outputs in outputs_all:
            merged.append(concatenate(outputs, axis=0))

        return Model(input=model.inputs, output=merged)

# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.reshape(60000, 784).astype('float32') / 255
x_test = x_test.reshape(10000, 784).astype('float32') / 255

print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

def basic_model():
    input = Input(shape=(784,))
    x = input
    for i in range(layer_count):
        x = Dense(layer_width, activation='relu')(x)
        x = Dropout(0.2)(x)
    output = Dense(10, activation='softmax')(x)

    model = Model(inputs=input, outputs=output)

    print('Single tower model:')
    model.summary()
    return model

tensorboard_dir = './tensorboard-logs/mnist_mlp_multi/%d-gpu_%s' \
    % (gpu_count, os.environ.get('CUDA_VISIBLE_DEVICES', ''))

with tf.device('/gpu:0'):
    if gpu_count > 1:
        tower = basic_model()
        model = make_parallel(tower, gpu_count)
        print('Multi-GPU model:')
        model.summary()
    else:
        model = basic_model()

    model.compile(loss='categorical_crossentropy',
                  optimizer=RMSprop(),
                  metrics=['accuracy'])

    summary_writer = tf.summary.FileWriter(tensorboard_dir, sess.graph)
    summary_writer.flush()

    tensorboard_cb = keras.callbacks.TensorBoard(log_dir=tensorboard_dir)
    history = model.fit(x_train, y_train,
                        batch_size=batch_size,
                        epochs=epochs,
                        verbose=1,
                        validation_data=(x_test, y_test),
                        callbacks=[tensorboard_cb])
    score = model.evaluate(x_test, y_test, verbose=0)
    print('Test loss:', score[0])
    print('Test accuracy:', score[1])

CUDA_VISIBLE_DEVICES=0,1,2,3 python mnist_mlp_func_multigpu.py

waleedka commented 7 years ago

@bzamecnik When you say it's slow, what are you comparing it to? It's normal for per-epoch time to be longer for multi-GPU than 1 GPU due to the additional overhead of splitting the data and synchronizing the gradient updates. You end up with slower batches, but more images per batch.

bzamecnik commented 7 years ago

@waleedka If an epoch (defined as 1 pass over a complete dataset) is slower for multiple GPUs than one GPU then there's no point in using more GPUs. The ideal expected result is that epoch time (for 1 GPU) would be divided by number of GPUs. Of course some small overhead is expected but not so big that the the total epoch time (multi-GPU) would exceed 1-GPU epoch time. In case the epoch is defined as a fixed number of batches (eg. drawn from an infinite generator), then you're correct that such an epoch time would be longer and a metric to watch would be number of images per second (aggregated eg. over one or more batches).

I tried to profile my testing model with nvprof and found that a 5-layer MLP with 1024 units each on MNIST is just dominated by communication (~95% of time) instead of computation. So it means it's quite bad testing model.

When I compared make_parallel() on CIFAR10 with an augmentation generator + TF CIFAR 10 multi-GPU example the computation dominates. But still make_parallel() model is significantly slower per epoch for multi-GPU case. TF model is fine.

It leads me to a hypothesis that data loading phase might be the bottleneck.

waleedka commented 7 years ago

@bzamecnik You're right, I meant an epoch of fixed number of steps. Okay, we're on the same page. We expect each single mini-batch step to be slightly slower but include more samples.

Your nvprof results are interesting. I'll try to do the same test on my model when if I get the chance.

Here is one thing you might want to try. I noticed that you're using gpu:0 as the parameter server. I could be wrong on this, but I think TF passes everything through the CPU (i.e no direct GPU-GPU communication yet). So data from gpu:3, for example, goes to the CPU first then gpu:0, and vice versa to replicate the updates from the PS to the other GPUs. Keeping the PS on the CPU might cut on communication. This is just speculation, but I think it's worth a try.

bzamecnik commented 7 years ago

@waleedka Thanks. This argument for cpu:0 as PS sounds interesting, I didn't think of it that way. Instead I thought that when params are on GPU the broadcast might be faster thanks to direct inter-GPU communication. At least we can see that kind of operations in nvprof (CUDA memcpy PtoP). Still this model is not suitable for multi-GPU setup, since it's dominated by inter-GPU communication.

# PS: gpu:0
$ CUDA_VISIBLE_DEVICES=1,2 /usr/local/cuda/bin/nvprof python mnist_mlp_func_multigpu.py 

==10408== Profiling application: python mnist_mlp_func_multigpu.py
==10408== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 72.05%  39.0289s     20709  1.8846ms  1.0880us  7.2779ms  [CUDA memcpy PtoP]
  9.76%  5.28512s      7101  744.28us     960ns  5.2840ms  [CUDA memcpy HtoD]
  3.05%  1.65180s      5850  282.36us  15.744us  4.5643ms  maxwell_sgemm_128x128_raggedMn_tn_splitK
  2.05%  1.11265s      6850  162.43us  118.15us  742.71us  maxwell_sgemm_128x128_raggedMn_nn_splitK
  1.53%  827.20ms      5900  140.20us  30.369us  2.8265ms  sgemm_128x128x8_TN_vec

# PS: cpu:0
$ CUDA_VISIBLE_DEVICES=1,2 /usr/local/cuda/bin/nvprof python mnist_mlp_func_multigpu.py

==8966== Profiling application: python mnist_mlp_func_multigpu.py
==8966== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 61.85%  52.2200s     27346  1.9096ms  1.1840us  5.3999ms  [CUDA memcpy HtoD]
 33.98%  28.6892s     16166  1.7747ms     992ns  5.1311ms  [CUDA memcpy DtoH]
  1.05%  883.48ms      6850  128.98us  95.844us  166.70us  maxwell_sgemm_128x128_raggedMn_nn_splitK
  0.81%  687.45ms      5850  117.51us  13.408us  815.94us  maxwell_sgemm_128x128_raggedMn_tn_splitK
  0.72%  610.64ms      5900  103.50us  25.536us  490.04us  sgemm_128x128x8_TN_vec

bzamecnik commented 7 years ago

As for the peer-to-peer transfer, this is what NVIDIA says:

Peer-to-peer memcopy:

If peer-access is enabled - Bytes are transferred along the shortest PCIe path, no staging through CPU memory
If peer-access is not available - CUDA driver stages the transfer via CPU memory

Multi-GPU Programming, slide 9

winwinJJiang commented 7 years ago

Hi, Thank very much for your hardwork.

I use the make parallel model and the saved model/weights I can not load. I see that's because of the different saving of parallel model and basic model. Do you have any idea to use make parallel while save the basic model?

Thanks

winwinJJiang commented 7 years ago

Hi, when I run your optimization code I got the error 'get_updates() takes exactly 4 arguments (3 given) '. Do you know what happend

bzamecnik commented 7 years ago

UPDATE: Work in progress on a very comprehensive article about data-parallel training in Keras: https://github.com/rossumai/keras-multi-gpu/tree/master/blog/docs.

TL;DR: we need first to feed data asynchronously via queues in order to make it scale well.

Also news from 2017-10-13 - fchollet added a cleaned-up version of kuza55 code to main Keras:

gaotianxiang commented 6 years ago

@bzamecnik Do you have the latest solution? I have tried the code and it still does not work. The efficiency for an epoch (go over the whole dataset) dropped when I use multi GPUs.

AmirAlavi commented 6 years ago

Thanks for the work on this subject. @bzamecnik, does this mean that the current multi_gpu_model() int utils (added in 2.0.9) suffers from the issues above?

If so, is the best use of multiple gpus to first figure out what the maximum batch size you can fit on one of your GPUs, and then once you have that number, use multi_gpu_model() with a new batch_size = N_gpus * max_batch_size?

bzamecnik commented 6 years ago

The current situation (as of 2017-10) has been described in our blog article Towards Efficient Multi-GPU Training in Keras with TensorFlow.

@AmirAlavi Thanks. The thing is that keras.utils.multi_gpu_model() works, in some cases is able to achieve good speedup (high PCIe bandwidth, small inputs), in other cases not (when time to transfer input data is significant compared to computation time). In particular, there's no problem with gradients not being distributed (as mentioned in this original issue). The remaining problem is lack of asynchronous feeding of input data to the GPU memory. I've made some prototypes of integrating StagingArea with Keras that's able to do async feeding to one GPU, but multi-GPU support is not yet done. On the other hand, you can try using the tensorpack library which performs data feeding well at the expense of a more complicated API.

As for the batch size, first determine a good batch size for one GPU. Ideally maximum power of 32 that fits the memory. In general the learning rate should scale with batch size. If it's rather high (~1024) it might be necessary to use techniques like warmup. A link to the respective paper is in the blog article. Then you can put this batch size on each GPU, ie. the total_batch_size = gpu_count * batch_size_per_gpu. Note that the batch size is not too much affected by input data fitting the GPU memory, but rather their feature maps (eg. for convolutional layers). I hope it answers your question well.

AmirAlavi commented 6 years ago

@bzamecnik thank you for the thorough response! I am working on a project which is implemented in Keras, but had been using some theano specific tensor operations. I just recently changed the code to be backend-agnostic, allowing me to use tensorflow, and I was excited to try multi-gpu (up until this point, I've been running on a compute node that had 4 gpus, but only able to use 1 gpu at a time).

However, I haven't seen any speedup, and in fact have seen very large decreases in performance. From your summary, I guess the only explanation is that the time to transfer the data is too high?

For context, I'm not doing CNNs. I'm using gene expression data (vectors of length ~20,000) and the architectures I'm training are Siamese Neural Networks, which means that they look at pairs of data. In effect, the input is two 20,000 dimensional vectors. The number of pairs one can generate is combinatorial, so it is large, and I was hoping to reduce the time per epoch by splitting this work among GPUs.

Does this sound like a case where tensorpack you mentioned above would help? I'm a bit concerned with engineering involved with getting MPI working. The devices we have are 4 x GeForce GTX 1080

bzamecnik commented 6 years ago

@AmirAlavi Yeah, it's quite possible that your PCIe bandwidth is low. This decrease in performance for multiple GPUs on slow PCIe's is what we observed on one machine. tf_cnn_benchmark with StagingArea (async GPU feeding) was scaling better (although not perfectly) in such a case. You can measure the bandwidth (roughly) via some benchmark code from CUDA Samples (example). Ideally each GPU would have it's own 16x PCIe slot. On my machine it showed ~13 GB/s. Slow cards were at 1x (800 MB/s).

The second thing is the size of the inputs. One your sample (a pair) in float32 would take ~160kB, 3-7x smaller than an ImageNet sample. But I don't know what's the batch size.

In any way, I'd recommend to run the training script with nvprof (around 1 min) and analyze that with NVIDIA Visual Profiler (NVVP). There you would see if the GPU significantly waits for data, what's the compute utilization, time for data transfer/computation, etc.

After writing the mentioned blog post, I released a small tool for dealing with the outputs from nvprof (SQLite database): nvprof-tools. In particular it's able to strip down a bulk of irrelevant events, show the total time and slice a time interval, thus reducing the file size dramatically. This helps NVVP not to be overwhelmed and run much more smoothly.

Another thing you can try is to run the computation on a cloud instance which can have higher CPU-GPU bandwidth.

ahundt commented 6 years ago

@bzamecnik Thanks for this work! Are the keras_tf_multigpu callbacks still current?

In my case I already have input tensors from tf.RecordInput, can those callbacks be easily adapted to my use case?

bzamecnik commented 6 years ago

@ahundt Thanks. Sorry, I didn't have much time to continue working on this problem. When I was writing this proof-of-concept loader callback it was something missing in Keras. Unfortunately it's still made for a single GPU and it's not working of validation split (since Keras doesn't call callbacks there). So far I don't know of any replacement for that. I haven't tried tf.RecordInput yet.

If you already have input as Tensors, IMHO the callback could be relatively easily modified to using that. Instead of feeding individual batches from numpy to to the features_batch_next we would just provide slices from the input tensor. But it would need a bit of time to think how exactly it should be connected together.

keras-team / keras