tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.5k stars 3.49k forks source link

Error running Transformer on TPU - official tutorial #1026

Open shriramsb opened 6 years ago

shriramsb commented 6 years ago

Description

Followed steps exactly as given in official tutorial for running Transformer on Cloud TPU - https://cloud.google.com/tpu/docs/tutorials/transformer except using PROBLEM=translate_enfr_wmt_small8k Got error: AttributeError: 'RunConfig' object has no attribute 'data_parallelism' when running step 3 of 'Train an English-German translation model' modified for my PROBLEM.

Command: t2t-trainer \
  --model=transformer \
  --hparams_set=transformer_tpu \
  --problem=$PROBLEM \
  --train_steps=10 \
  --eval_steps=3 \
  --data_dir=$DATA_DIR \
  --output_dir=$OUT_DIR \
  --use_tpu=True \
  --master=$TPU_MASTER

Environment information

OS:
Distributor ID: Debian
Description:    Debian GNU/Linux 9.5 (stretch)
Release:        9.5
Codename:       stretch

$ pip3 freeze | grep tensor
tensor2tensor==1.8.0
tensorboard==1.9.0
tensorflow==1.9.0

$ python3 -V
Python 3.5.3

For bugs: reproduction and error logs

Steps to reproduce:

This can be reproduced by following tutorial - https://cloud.google.com/tpu/docs/tutorials/transformer with PROBLEM=translate_enfr_wmt_small8k instead of 'translate_ende_wmt32k_packed'. Error will occur when running step 3 of 'Train an English-German translation model' modified for this problem.

t2t-trainer \
  --model=transformer \
  --hparams_set=transformer_tpu \
  --problem=$PROBLEM \
  --train_steps=10 \
  --eval_steps=3 \
  --data_dir=$DATA_DIR \
  --output_dir=$OUT_DIR \
  --use_tpu=True \
  --master=$TPU_MASTER

Error logs:

WARNING:tensorflow:Estimator's model_fn (<function T2TModel.make_estimator_model_fn.<locals>.wrapping_model_fn at 0x7f9a78f56c80>) includes params argument, but par
ams are not passed to Estimator.
INFO:tensorflow:Using config: {'_cluster': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9aa2003ba8>, 'use_tpu': True, '_s
ave_checkpoints_steps': 1000, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=8, computation_shape=None, per_host_input_for_training=2, tpu_job_name=No
ne, initial_infeed_sleep_secs=None), '_train_distribute': None, '_task_type': 'worker', '_service': None, '_is_chief': True, '_evaluation_master': 'grpc://10.240.1.
2:8470', '_num_worker_replicas': 1, '_save_checkpoints_secs': None, '_save_summary_steps': 100, '_keep_checkpoint_max': 20, '_model_dir': 'gs://tpu_test1/training/t
ransformer_ende_1', '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.95
}
allow_soft_placement: true
graph_options {
}
, '_task_id': 0, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_global_id_in_cluster': 0, '_device_fn': None, '_master': 'grpc://10.240.1
.2:8470', '_tf_random_seed': None, '_num_ps_replicas': 0}
INFO:tensorflow:_TPUContext: eval_on_tpu True
WARNING:tensorflow:ValidationMonitor only works with --schedule=train_and_evaluate
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after 600 secs (eval_spec.throttle_secs) or training is finished.
INFO:tensorflow:Querying Tensorflow master (grpc://10.240.1.2:8470) for TPU system metadata.
2018-08-28 20:33:28.565636: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:CPU:0, CPU, -1)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_GPU:0, XLA_GPU, 17179869184)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:0, TPU, 17179869184)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:1, TPU, 17179869184)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:2, TPU, 17179869184)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:3, TPU, 17179869184)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:4, TPU, 17179869184)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:5, TPU, 17179869184)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:6, TPU, 17179869184)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:7, TPU, 17179869184)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 17179869184)
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:num_partitions = 1 partition_id = 0
INFO:tensorflow:Reading data files from gs://tpu_test1/data/translate_enfr_wmt_small8k-train*
INFO:tensorflow:partition: 0 num_data_files: 100
Traceback (most recent call last):
  File "/usr/local/bin/t2t-trainer", line 32, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/usr/local/bin/t2t-trainer", line 28, in main
    t2t_trainer.main(argv)
  File "/usr/local/lib/python3.5/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 385, in main
    execute_schedule(exp)
  File "/usr/local/lib/python3.5/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 326, in execute_schedule
    getattr(exp, FLAGS.schedule)()
  File "/usr/local/lib/python3.5/dist-packages/tensor2tensor/utils/trainer_lib.py", line 331, in continuous_train_and_eval
    self._eval_spec)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 447, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 531, in run
    return self.run_local()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 669, in run_local
    hooks=train_hooks)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 366, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1119, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1132, in _train_model_default
    features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1992, in _call_model_fn
    features, labels, mode, config)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1107, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2223, in _model_fn
    _train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2537, in _train_on_tpu_system
    device_assignment=ctx.device_assignment)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu.py", line 733, in shard
    name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu.py", line 394, in replicate
    device_assignment, name)[1]
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu.py", line 546, in split_compile_and_replicate
    outputs = computation(*computation_inputs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2530, in multi_tpu_train_steps_on_single_shard
    single_tpu_train_step, [_INITIAL_LOSS])
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/training_loop.py", line 207, in repeat
    cond, body_wrapper, inputs=inputs, infeed_queue=infeed_queue, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/training_loop.py", line 169, in while_loop
    name="")
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3209, in while_loop
    result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2941, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2878, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/training_loop.py", line 120, in body_wrapper
    outputs = body(*(inputs + dequeue_ops))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/training_loop.py", line 203, in body_wrapper
    return [i + 1] + _convert_to_list(body(*args))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1166, in train_step
    self._call_model_fn(features, labels))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1337, in _call_model_fn
    estimator_spec = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensor2tensor/utils/t2t_model.py", line 1184, in wrapping_model_fn
    decode_hparams=decode_hparams)
  File "/usr/local/lib/python3.5/dist-packages/tensor2tensor/utils/t2t_model.py", line 1219, in estimator_model_fn
    data_parallelism = config.data_parallelism
AttributeError: 'RunConfig' object has no attribute 'data_parallelism'
tlatkowski commented 6 years ago

Have the same error while running the transformer on TPU and Librispeech problem.

prasastoadi commented 6 years ago

Try to use tensor2tensor==1.7.0

mikeymezher commented 6 years ago

The problem seems to stem from the definition of estimator_model_fn in t2t_model.py After t2t v1.7.0 they removed the explicitly passed "use_tpu" parameter and now try to get it from the "params" dictionary parameter. Only problem is I don't know where params is passed from. Does anyone know how to set params?

nicks165 commented 6 years ago

Is this fixed?

shriramsb commented 6 years ago

Tensor2tensor 1.7.0 works. 1.8.0 doesn't.

nicks165 commented 6 years ago

You mean Tensor2tensor 1.7.0 ? And will it work with Tensorflow 1.9?

nicks165 commented 6 years ago

If i intsall T2T 1.7 and in TPU tensorflow version is 1.8 or 1.9 == No TPU cores found. T2T >1.7 TPU tensorflow version 1.8 or 1.9 == doesnt work == AttributeError: 'RunConfig' object has no attribute 'data_parallelism'

mikeymezher commented 6 years ago

@nicks165 I've gotten around this issue on T2T v1.9.0 by modifying t2t_model.py. On line 1248 in estimator_model_fn add: " if config.use_tpu: params = {'use_tpu':True}"

I'm not sure what the proper way to set the params dict is (this is not hparams, or hparams set, and there doesn't seem to be a flag to set params). It may be safer to modify params to default to an empty dict and instead on line 1248 to instead add: " if config.use_tpu: params['use_tpu']=True" to prevent completely overwriting params if it IS passed. (Haven't tested this, but it should work).

nicks165 commented 6 years ago

It still gives above error when params is passed in the dictionary. File "/usr/local/lib/python3.5/dist-packages/tensor2tensor/utils/t2t_model.py", line 1219, in estimator_model_fn data_parallelism = config.data_parallelism AttributeError: 'RunConfig' object has no attribute 'data_parallelism'