tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.32k stars 3.47k forks source link

*bug* error when use rnn and multi-gpu #823

Open xiang-deng opened 6 years ago

xiang-deng commented 6 years ago

I add a recurrent layer to the model body, as in my model the current output depends on the previous. It runs well in single GPU but fails when use multi-gpu. the log looks as below:

INFO:tensorflow:Cannot use 'Identity_264' as input to 'Identity_129' because they are in different while loops.

Identity_264 while context: hi_trans/parallel_1_9/hi_trans/body/rnn/while/while_context
Identity_129 while context: hi_trans/parallel_0_9/hi_trans/body/rnn/while/while_context

Traceback for Identity_264:
  File "/hdfs/pnrsy/v-xiden/philly_workspace/sentence_selection/t2t_train.py", line 58, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/hdfs/pnrsy/v-xiden/philly_workspace/sentence_selection/t2t_train.py", line 54, in main
    t2t_trainer.main(flags_passthrough)
  File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 337, in main
    execute_schedule(exp)
  File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 287, in execute_schedule
    getattr(exp, FLAGS.schedule)()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 388, in train
    saving_listeners=self._saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 868, in _call_train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/estimator/estimator.py", line 352, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/estimator/estimator.py", line 812, in _train_model
    features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/estimator/estimator.py", line 793, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py", line 874, in wrapping_model_fn
    use_tpu=use_tpu)
  File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py", line 925, in estimator_model_fn
    logits, losses_dict = model(features)  # pylint: disable=not-callable
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/layers/base.py", line 696, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py", line 139, in call
    sharded_logits, losses = self.model_fn_sharded(sharded_features)
  File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py", line 191, in model_fn_sharded
    sharded_logits, sharded_losses = dp(self.model_fn, datashard_to_features)
  File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/expert_utils.py", line 254, in __call__
    outputs.append(fns[i](*my_args[i], **my_kwargs[i]))
  File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py", line 219, in model_fn
    body_out = self.body(transformed_features)
  File "/hdfs/pnrsy/v-xiden/philly_workspace/sentence_selection/sentence_selection.py", line 863, in body
    dtype=tf.float32)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/rnn.py", line 632, in dynamic_rnn
    dtype=dtype)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/rnn.py", line 829, in _dynamic_rnn_loop
    swap_memory=swap_memory)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3096, in while_loop
    result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2874, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2814, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3075, in <lambda>
    body = lambda i, lv: (i + 1, orig_body(*lv))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/rnn.py", line 800, in _time_step
    (output, new_state) = call_cell()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/rnn.py", line 786, in <lambda>
    call_cell = lambda: cell(input_t, state)
  File "/hdfs/pnrsy/v-xiden/philly_workspace/sentence_selection/sentence_selection.py", line 585, in __call__
    *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/layers/base.py", line 680, in __call__
    self.build(input_shapes)
  File "/hdfs/pnrsy/v-xiden/philly_workspace/sentence_selection/sentence_selection.py", line 651, in build
    shape=[input_depth, 1])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/layers/base.py", line 533, in add_variable
    partitioner=partitioner)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 1297, in get_variable
    constraint=constraint)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 1093, in get_variable
    constraint=constraint)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 431, in get_variable
    return custom_getter(**custom_getter_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/expert_utils.py", line 206, in daisy_chain_getter
    v = tf.identity(last_device_v)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/array_ops.py", line 131, in identity
    return gen_array_ops.identity(input, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 2051, in identity
    "Identity", input=input, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

Traceback for Identity_129:
  File "/hdfs/pnrsy/v-xiden/philly_workspace/sentence_selection/t2t_train.py", line 58, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/hdfs/pnrsy/v-xiden/philly_workspace/sentence_selection/t2t_train.py", line 54, in main
    t2t_trainer.main(flags_passthrough)
  File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 337, in main
    execute_schedule(exp)
  File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 287, in execute_schedule
    getattr(exp, FLAGS.schedule)()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 388, in train
    saving_listeners=self._saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 868, in _call_train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/estimator/estimator.py", line 352, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/estimator/estimator.py", line 812, in _train_model
    features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/estimator/estimator.py", line 793, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py", line 874, in wrapping_model_fn
    use_tpu=use_tpu)
  File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py", line 925, in estimator_model_fn
    logits, losses_dict = model(features)  # pylint: disable=not-callable
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/layers/base.py", line 696, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py", line 139, in call
    sharded_logits, losses = self.model_fn_sharded(sharded_features)
  File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py", line 191, in model_fn_sharded
    sharded_logits, sharded_losses = dp(self.model_fn, datashard_to_features)
  File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/expert_utils.py", line 254, in __call__
    outputs.append(fns[i](*my_args[i], **my_kwargs[i]))
  File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/t2t_model.py", line 219, in model_fn
    body_out = self.body(transformed_features)
  File "/hdfs/pnrsy/v-xiden/philly_workspace/sentence_selection/sentence_selection.py", line 863, in body
    dtype=tf.float32)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/rnn.py", line 632, in dynamic_rnn
    dtype=dtype)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/rnn.py", line 829, in _dynamic_rnn_loop
    swap_memory=swap_memory)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3096, in while_loop
    result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2874, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2814, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3075, in <lambda>
    body = lambda i, lv: (i + 1, orig_body(*lv))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/rnn.py", line 800, in _time_step
    (output, new_state) = call_cell()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/rnn.py", line 786, in <lambda>
    call_cell = lambda: cell(input_t, state)
  File "/hdfs/pnrsy/v-xiden/philly_workspace/sentence_selection/sentence_selection.py", line 585, in __call__
    *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/layers/base.py", line 680, in __call__
    self.build(input_shapes)
  File "/hdfs/pnrsy/v-xiden/philly_workspace/sentence_selection/sentence_selection.py", line 651, in build
    shape=[input_depth, 1])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/layers/base.py", line 533, in add_variable
    partitioner=partitioner)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 1297, in get_variable
    constraint=constraint)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 1093, in get_variable
    constraint=constraint)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py", line 431, in get_variable
    return custom_getter(**custom_getter_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensor2tensor/utils/expert_utils.py", line 209, in daisy_chain_getter
    v = tf.identity(var._ref())  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/array_ops.py", line 131, in identity
    return gen_array_ops.identity(input, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 2051, in identity
    "Identity", input=input, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

I use the raw tf.nn.dynamic_rnn, is there extra modification needed for it to work? Thanks in advance.

mapingshuo commented 6 years ago

In my case, I am training a lstm_seq2seq model with lstm_seq2seq params, I got the following err when I try to use multi GPU."ValueError: Cannot use 'lstm_seq2seq/parallel_1_5/lstm_seq2seq/body/lstm_seq2seq/encoder/rnn/while/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/MatMul' as input to 'Identity_2' because they are in different while loops. See info log for more details."

colmantse commented 5 years ago

i want to follow up on this issue, i recently modified body and introduced a while loop into it. This gives an error very similar to that from @mapingshuo . The code works fine with single gpu and crash with multi-gpu where raised a valueError involving 2 identities that are from different name_scope (parallel_0_5, parallel_1_5 for instance). Disabling daisy_variable_chain will get pass this error but would get into an even worse mess.

According to my understanding, this is because parallel scope in data_parallelism is only a name_scope and therefore would not prevent fetching of variables from parallel_1 to parallel_0 as input and hence we got the while_loop check error.

On another hand, the fetching of variable from different copies might not be a bad idea since thats how we train multi-gpu models i believe, so this just unfortunately doesnt work on stuff when while_loop is involved.

Maybe modifying the daisy_chain_getter and introduce regex against while loop variables would work? Would be great to get a pointer to tackling this issue.

mzaidi59 commented 4 years ago

https://github.com/tensorflow/tensor2tensor/blob/543293cfb490121f67c6d87287b9b0c7c3670286/tensor2tensor/layers/common_hparams.py#L256

This setting controls whether to copy variables around in a daisy chain (if true) or leave their placement to TensorFlow. It only affects multi device training and mostly should be turned on for performance. One exception are recurrent models: with dynamic loops it must be off.

Kindly refer to this, worked for me!