papajohn commented 7 years ago

I think seq2seq training is not using multiple GPUs. The tokens/sec metric is the same as when I was training on a VM with only 1 GPU or 4 GPUs.

Can someone provide a demo of how to use 4 GPUs on a single machine? All I found in the docs was https://google.github.io/seq2seq/training/#distributed-training . That links to an example of how to use multiple devices using tf.device and how to use a cluster with tf.learn, but I couldn't figure out how to proceed with either approach. Thanks!

Running python -m bin.train as specified in https://google.github.io/seq2seq/nmt/ ...

Four devices are found (from logs):

I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y N N N 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1:   N Y N N 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2:   N N Y N 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3:   N N N Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: a370:00:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 9f8e:00:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: b265:00:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 8743:00:00.0)

Memory is allocated to all 4, but only one GPU has non-zero utilization.

$ nvidia-smi 
Tue Mar 14 19:42:15 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 8743:00:00.0     Off |                    0 |
| N/A   50C    P0    74W / 149W |  10363MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 9F8E:00:00.0     Off |                    0 |
| N/A   78C    P0    67W / 149W |  10363MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | A370:00:00.0     Off |                    0 |
| N/A   74C    P0    94W / 149W |  10402MiB / 11439MiB |     46%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | B265:00:00.0     Off |                    0 |
| N/A   62C    P0    64W / 149W |  10363MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

dennybritz commented 7 years ago

There is nothing about GPU device placement hardcoded, so Tensorflow should handle the device placement. I usually train with 1 GPU only (but multiple workers), so I haven't tried the multi-GPU case.

Can you try running a larger model? It could be that TF decides that the small model is not worth splitting across GPUs. Hopefully TF will put the computation on separate devices. E.g. use nmt_large.yml instead of nmt_small.yml as your config. If it doesn't work, we may need to add tf.device statements to put different RNN layers on different GPUs.

papajohn commented 7 years ago

I was using nmt_large.yml above. Thanks for the quick response!

python -m bin.train \
  --config_paths="./example_configs/nmt_large.yml,./example_configs/train_seq2seq.yml" \
  --model_params "
      vocab_source: $VOCAB_SOURCE
      vocab_target: $VOCAB_TARGET" \
  --input_pipeline_train "
    class: ParallelTextInputPipeline
    params:
      source_files:
        - $TRAIN_SOURCES
      target_files:
        - $TRAIN_TARGETS" \
  --input_pipeline_dev "
    class: ParallelTextInputPipeline
    params:
       source_files:
        - $DEV_SOURCES
       target_files:
        - $DEV_TARGETS" \
  --batch_size 32 \
  --buckets 8,12,16,20,24,28,32,36,40 \
  --train_steps $TRAIN_STEPS \
  --output_dir $MODEL_DIR

papajohn commented 7 years ago

By the way, I was just expecting data parallelism — that different batches would be processed on different GPUs. Sounds very similar to your multiple worker set-up, just on one machine. (But I still don't know how to invoke that, if it's even possible.)

dennybritz commented 7 years ago

I see. I think it is not too common to have data parallelism on the same machine for seq2seq models, but people have found that putting different RNN layers on separate devices speed up things, and we should do that if more than 1 GPU is available.

I will need to look into data parallelism on multiple GPUs. In the best case all we need is instantiate the model multiple times on a separate GPU and average the losses. In that case it may only require a few lines of code change. But maybe it's more complex than that.

Thanks for reporting, I'll take a look at this soon (may take 2-3 days).

papajohn commented 7 years ago

Great! Thanks for taking a look.

I think the use case is reasonably common among academics: launch a fresh 8-GPU instance on some public cloud, install/configure software, download data, & run an experiment.

OpenNMT follows this model, I believe.

dennybritz commented 7 years ago

Sounds reasonable. Will add this in the next few days.

vongruenigen commented 7 years ago

@dennybritz, may I ask what's the state of this issue? I'm currently trying to train a conversational dialogue system using this tool and would like to train the model using multiple GPUs since our (desired) model is rather huge, with 4096 hidden units in the encoder/decoder each, and I currently run into OOM problems when the size of my model exceeds 2048 hidden units.

I'm willed to invest some time to help you implementing this feature (if needed). I already took a quick look at the code and couldn't find an obvious place where put the with tf.device(...) wrapper. As far as I understand it, the computational graph must be splitted into multiple parts if I want to leverage the computational power of multiple GPUs (not only RAM). Due to the nature of seq2seq models, this could for example be done by putting the encoder on the first GPU and the decoder on another, right? But I also see some problems, for example does the attention mechanism still work "out of the box" if the encoder is placed on different GPU than the decoder?

dennybritz commented 7 years ago

The original issue of parallelizing training across multiple GPUs through data parallelism is very high on my priority list and I will add that ASAP.

However, that seems different from your issue, @vongruenigen. What you want is split the model across multiple GPUs. You're not going to fit a model that big into a single GPU. Just to do a back of the envelope calculation, if you have a ~30k vocab and 4096 units, then your softmax matrix will be 4096 * 3* 30,000 *32 = 11.7GB alone. So, you're not going to fit that model onto a single GPU, no matter what code you use. To make this work you'd need to modify the model code and use something like sampled softmax, or implement a sharded softmax yourself.

Due to the nature of seq2seq models, this could for example be done by putting the encoder on the first GPU and the decoder on another, right? But I also see some problems, for example does the attention mechanism still work "out of the box" if the encoder is placed on different GPU than the decoder?

It will still work, but it's not going to help you. The vast majority of parameters/memory are usually in the softmax and embeddings/inputs. That's what you need to split (or use an alternative) and there is no "obvious" way to do that, other than maybe using the sampled_softmax_loss in Tensorflow. I haven't used that myself though, and it will only help you for training, not inference.

vongruenigen commented 7 years ago

@dennybritz, I was aware that a large number of parameters is placed in the softmax, but I didn't realize that it's that huge. I'm going to investigate into using sampled/sharded softmax and try to find a solution. Thanks a lot for the quick response and the clarifying explanation!

skyw commented 7 years ago

Distributed Training is supported out of the box using tf.learn. Cluster Configurations can be specified using the TF_CONFIG environment variable, which is parsed by the RunConfig. Refer to the Distributed Tensorflow Guide for more information.

Any example of how this works?

dennybritz commented 7 years ago

Any example of how this works?

For a general introduction to distributed training settings check out the Tensorflow tutorial: https://www.tensorflow.org/deploy/distributed

I haven't seen any example of using TF_CONFIG, but check out the documentation in this file: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/learn/python/learn/estimators/run_config.py

So instead of needing to change the code I believe you should be able to set all required options via the environment variable.

davidecaroselli commented 7 years ago

Hi @dennybritz

any news on this topic? I was trying to train a nmt_large model on 8 GPUs machine but I confirm that only one was actually used.

Here's the output of the nvidia-smi command:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:00:17.0     Off |                    0 |
| N/A   70C    P0    75W / 149W |  10417MiB / 11439MiB |     71%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:00:18.0     Off |                    0 |
| N/A   52C    P0    81W / 149W |  10378MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000:00:19.0     Off |                    0 |
| N/A   63C    P0    65W / 149W |  10378MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:00:1A.0     Off |                    0 |
| N/A   55C    P0    79W / 149W |  10378MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           Off  | 0000:00:1B.0     Off |                    0 |
| N/A   65C    P0    64W / 149W |  10378MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           Off  | 0000:00:1C.0     Off |                    0 |
| N/A   50C    P0    77W / 149W |  10378MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           Off  | 0000:00:1D.0     Off |                    0 |
| N/A   66C    P0    67W / 149W |  10378MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           Off  | 0000:00:1E.0     Off |                    0 |
| N/A   54C    P0    81W / 149W |  10378MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      2316    C   python                                       10407MiB |
|    1      2316    C   python                                       10368MiB |
|    2      2316    C   python                                       10368MiB |
|    3      2316    C   python                                       10368MiB |
|    4      2316    C   python                                       10368MiB |
|    5      2316    C   python                                       10368MiB |
|    6      2316    C   python                                       10368MiB |
|    7      2316    C   python                                       10368MiB |
+-----------------------------------------------------------------------------+

btw, it seems that tensorflow is actually using the memory of all the GPUs, but only one of them is actually used. Is this something expected?

Boggartfly commented 7 years ago

Interesting...

wolfshow commented 7 years ago

@davidecaroselli I have the same problem.

hrishikeshvganu commented 7 years ago

@dennybritz : wanted to know if there are any updates on this.

kirtisbhandari commented 7 years ago

have the same issue.

liyi193328 commented 7 years ago

Is there any update or ideas ? I also want to train a model with multiple gpu. It seem @dennybritz is busy with other stuffs.

liyi193328 commented 7 years ago

@davidecaroselli @wolfshow face the same problems. How do you smart gays solve the problem? much thanks

NingHongKe commented 7 years ago

waiting

stefan-it commented 7 years ago

I would recommend the tensor2tensor library, support of multiple gpus is working pretty well: https://github.com/tensorflow/tensor2tensor

nptdat commented 7 years ago

@davidecaroselli About using all GPU memory problem, TF provides gpu_options.allow_growth option on session config. If it's True, TF will start with small memory & allocate more when it requires. If it's False (default), TF will allocate all of the memory at the beginning. That's why you have seen all of your GPU mem is allocated. Ref: https://www.tensorflow.org/tutorials/using_gpu

I don't use seq2seq yet, but look at its bin/train.py, I found the flag named gpu_allow_growth which actually provides value for the original gpu_options.allow_growth option. It's clearly set to False as default. I guess that you can set this flag to True to request TF to allocate memory on demand.

yanghoonkim commented 7 years ago

@nptdat In fact, that doesn't solve those problems. I think the only way make fully use of the gpu is 1. data parallelism 2. allocate each gpu to each layer/some layers manually. However, this library seems to be abandoned....

yanghoonkim commented 7 years ago

I see. I think it is not too common to have data parallelism on the same machine for seq2seq models, but people have found that putting different RNN layers on separate devices speed up things, and we should do that if more than 1 GPU is available.

But The results page said that @dennybritz used 8 gpus: https://google.github.io/seq2seq/results/

nptdat commented 7 years ago

@ad26kt Yeah, I just mentioned about all memory allocation problem, not about how to make all GPU work.

imranshaikmuma commented 7 years ago

guys the answer to all your problems is cudann https://developer.nvidia.com/cudnn install cudnn from above link install instructions https://stackoverflow.com/questions/42013316/after-building-tensorflow-from-source-seeing-libcudart-so-and-libcudnn-errors

MY TOKENS AFTER I USE THIS: 2017-07-12 15:42:33.111509: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 1 2 3 2017-07-12 15:42:33.111533: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y Y Y Y 2017-07-12 15:42:33.111539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 1: Y Y Y Y 2017-07-12 15:42:33.111543: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 2: Y Y Y Y 2017-07-12 15:42:33.111547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 3: Y Y Y Y

BUT AFTER TRAINING THE MODEL, I CAN SEE THE UTILIZATION ONLY FOR ONE GPU THAT MEANS IT IS USING 4 GPUS WHILE TRAINING BUT AFTER TRAINING IT IS JUST COMING BACK TO ONE GPU WE MUST QUERY NVIDIA-SMI WHILE WE TRAIN USING DIFFERENT CONNECTION I FEEL I WILL TRY AND UPDATE

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 2006 G /usr/lib/xorg/Xorg 15MiB | | 0 2398 C python3 10894MiB | | 1 2398 C python3 10867MiB | | 2 2398 C python3 10867MiB | | 3 2398 C python3 10865MiB | +-----------------------------------------------------------------------------+

imranshaikmuma commented 7 years ago

I used MXNET and solved the issue

sampathchanda commented 7 years ago

@ad26kt No, you can use data parallelism too, in TensorFlow. Refer to the following cifar10 example provided.

As @nptdat mentioned, I also suspect that allow_growth is the reason for using up all the memory available. Even if you are using only a single GPU model, tensorflow by default allocates full memory on all the GPUs it can see.

If you are not aware of this previously, visibility of GPUs to a certain application can be controlled by prepending the run command with 'CUDA_VISIBLE_DEVICES='.

DucVuMinh commented 7 years ago

@sampathchanda I run my model with multi GPUs and data parallelism but all GPU's memory is located while only one GPU is used to calculate. And I also run code at cifar10 example but it is same as situation describing above. Can you explain why?

imranshaikmuma commented 7 years ago

@DucVuMinh Tensorflow by default uses memory power of all GPUs as it allocates maximum memory for your job but not processing speed. To utilize the processing power of all GPUs as well you need to specify tf.device statements where ever you want to do parallel processing in your code. In Tensorflow, you have to manually assign devices on your own and also calculate the overall gradients by collecting output from all devices on your own. But MXNET does this thing automatically and you just need to specify CONTEXT statement indicating list of GPUs available. You dont have to calculate the average loss of your model by yourself. It will do it on its own. Let me know if you have any more questions

DucVuMinh commented 7 years ago

@imranshaikmuma In my model I also use tf.device statements to do parallel processing. I implement as scenario describing in cifar10 multi gpu train but when training I see that only one GPU is using. And when I run the example: cifar10 multi gpu train, it still use only one GPU while all memory of all GPUs are located.

imranshaikmuma commented 7 years ago

@DucVuMinh can you show me your code?

nptdat commented 7 years ago

@DucVuMinh When running cifar10_multi_gpu_train.py, did you set the flag num_gpus to a value >1 (e.g., the number of GPUs you have) ? The setting is on the line 59.

For memory problem, you can try to add one more line to set the flag gpu_allow_growth to True. I guess this setting will request TF to allocate mem on demand, not use all at the beginning.

DucVuMinh commented 7 years ago

@imranshaikmuma This is my code: `

Create an optimizer that performs gradient descent.

opt = tf.train.GradientDescentOptimizer(lr)
#create queue file_input
filename_queue = tf.train.string_input_producer(arr_file_data, num_epochs=None)
#read and decode data Node
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
record_defaults = np.ones([14491,1])
example_vec = tf.decode_csv(value, record_defaults=record_defaults.tolist())
min_after_dequeue = 100
capacity = min_after_dequeue + 3 * option_variables.batch_size
#create a batch Node
batch = tf.train.shuffle_batch(
      [example_vec], batch_size=option_variables.batch_size, capacity=capacity,
      min_after_dequeue=min_after_dequeue)
#split batch in to data, label, mask, weight_loss, length
data, label, mask, weight_loss, length = tf.split(batch, [14280, 70, 70, 70, 1], 1)

 # Split the batch of data, label, mask, weight_loss, length for towers.
datas_splits = tf.split(axis=0, num_or_size_splits=FLAGS.num_gpus, value=data)
masks_splits = tf.split(axis=0, num_or_size_splits=FLAGS.num_gpus, value=mask)
weights_losss_splits = tf.split(axis=0, num_or_size_splits=FLAGS.num_gpus, value=weight_loss)
lengths_splits = tf.split(axis=0, num_or_size_splits=FLAGS.num_gpus, value=length)
label_splits = tf.split(axis=0, num_or_size_splits=FLAGS.num_gpus, value=label)

with tf.variable_scope("train") as scope:
    # Calculate the gradients for each model tower.
    tower_grads = []
    for i in range(FLAGS.num_gpus):
        with tf.device('/gpu:%d' % i):
            with tf.name_scope('/gpu_cal%d' % i):
                #create variable for model
                weights, biases, w_fw, b_fw, w_bw, b_bw = create_model_variable()
                #get loss and some node in model to run
                loss_ = \
                        loss(input=datas_splits[i], input_length=lengths_splits[i], lable= label_splits[i],
                            masks = masks_splits[i],W = weights, bias= biases, W_fw = w_fw, bias_fw= b_fw,
                            W_bw = w_bw, bias_bw= b_bw,weight_loss= weights_losss_splits[i])
                # Reuse variables for the next tower.
                scope.reuse_variables()
                # Calculate the gradients for the batch of data on this CIFAR tower.
                grads = opt.compute_gradients(loss)
                # Keep track of the gradients across all towers.
                tower_grads.append(grads)

    # calculate the mean of each gradient
    grads = average_gradients(tower_grads)
    # Apply the gradients to adjust the shared variables.
    apply_gradient_op = opt.apply_gradients(grads)
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    config.gpu_options.per_process_gpu_memory_fraction = 1
    config.log_device_placement = True
    config.allow_soft_placement = True
    # Build an initialization operation to run below.
    init = tf.global_variables_initializer()
    sess = tf.Session(config=config)
    #init variables
    sess.run(init)
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord, sess= sess)
    #loop in number of epochs for training
    for i in range(0,num_epochs):
        print ("epoch ", i)
        for j in range(number_batch):
            _, _loss, _accuracy = \
                sess.run([apply_gradient_op, loss_])
    coord.request_stop()
    coord.join(threads)`

I'm using three layers of LSTM. Can you look over for me? Thank you very much.

kdavis-mozilla commented 7 years ago

@papajohn On one machine I'd guess, though I have not yet tested, one can use multiple GPU's by creating a cluster on a single node with one worker per GPU using the environment variables TF_CONFIG and CUDA_VISIBLE_DEVICES. You can find info on setting TF_CONFIG here[1] and you can find info on setting CUDA_VISIBLE_DEVICES here[2].

Amitayus commented 6 years ago

I would recommand tf.learn. It is such a good tool, for many distributed trainning can be done with tf.contrib.learn.experiment. After an experiment is created, an experiment instance knows how to invoke trainning and eval loops in a sensible fashion for distributed training.

Benz-Tracxpoint commented 6 years ago

I am encountering the same issue, if anyone finds out a solution Please keep us posted

DucVuMinh commented 6 years ago

@Benz-Tracxpoint Before, I also encountered this problem. But I solved this. Check the following step to make sure your set up and code are correct. Step 1: Set up number of GPU > 1 Step 2: Using tf.device to assign calculation job to each of GPU

mzlr commented 6 years ago

@Benz-Tracxpoint

I found a solution using slim.model_deploy. It implements the In-graph replication synchronous training for a single machine with multiple GPUs (averaging gradients as in CIFAR-10 multi-GPU trainer).

Usage from slim.model_deploy:

with tf.Graph().as_default():
  # Set up DeploymentConfig, num_clones should not be more the number of GPUs
  config = model_deploy.DeploymentConfig(num_clones=num_GPUs_you_want_to_use)

  # Create the global step on the device storing the variables.
  with tf.device(config.variables_device()):
    global_step = slim.create_global_step()

  # Define the inputs for each clone
  with tf.device(config.inputs_device()):
    images, labels = LoadData(...)
    inputs_queue = slim.data.prefetch_queue((images, labels))

  # Define the optimizer.
  with tf.device(config.optimizer_device()):
    optimizer = tf.train.MomentumOptimizer(FLAGS.learning_rate, FLAGS.momentum)

  # Define the model including the loss.
  def model_fn(inputs_queue):
    images, labels = inputs_queue.dequeue()
    predictions = CreateNetwork(images)
    slim.losses.log_loss(predictions, labels)
  model_dp = model_deploy.deploy(config, model_fn, [inputs_queue], optimizer=optimizer)

  # Run training.
  slim.learning.train(model_dp.train_op, my_log_dir,
                      summary_op=model_dp.summary_op)

google / seq2seq

How to use Multiple GPUs? #44

Create an optimizer that performs gradient descent.