google-research / text-to-text-transfer-transformer

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
https://arxiv.org/abs/1910.10683
Apache License 2.0
6.16k stars 756 forks source link

Not fine-tuning in GPU even with mesh_shape and mesh_devices correctly configured #137

Closed mcompute closed 4 years ago

mcompute commented 4 years ago

I have attempted to re-create the same fine-tuning experiment in t5-trivia.ipynb on my ubuntu machine, using GPU.

I have followed the recommendation (in issue #107 to configure mesh_shape="model:1,batch:1" and mesh_devices=["gpu:0"] and, observed the allocation of all GPU memory but GPU utilization is less than 10% and usually 0%. Fine-tuning step is still done on CPU.

Is this the expected behavior or something else I need to configure?

craffel commented 4 years ago

Do you mind posting a gist of your current code and I can play around with it?

mcompute commented 4 years ago

Using the notebook t5-trivia.ipynb, I have configured TPU_ADDRESS = None and TPU_TOPOLOGY = None and modified the below code to support GPU training as recommended.

model = t5.models.MtfModel( model_dir=MODEL_DIR, tpu=TPU_ADDRESS, tpu_topology=TPU_TOPOLOGY, model_parallelism=model_parallelism, batch_size=train_batch_size, sequence_length={"inputs": 128, "targets": 32}, mesh_shape="model:1,batch:1", mesh_devices=["gpu:0"], learning_rate_schedule=0.003, save_checkpoints_steps=5000, keep_checkpoint_max=None, iterations_per_loop=100, )

mcompute commented 4 years ago

Also to mention, if using nightly version of Tensorflow/Tensorflow-Text/Mesh-Tensorflow, you will need to modify MtfModel so that the model can run on CPU/GPU instead of TPU.

Specifically, changes to following codes: In train function utils.train_model(self.estimator(vocabulary, init_checkpoint, False if self._tpu else True), vocabulary, self._sequence_length, self.batch_size, dataset_fn, steps, self._ensemble_inputs, dataset_split=split)

In estimator function utils.get_estimator( model_type=self._model_type, vocabulary=vocabulary, layout_rules=self._layout_rules, mesh_shape=self._mesh_shape if self._mesh_shape else mtf.Shape([]), mesh_devices=self._mesh_devices, model_dir=self._model_dir, batch_size=self.batch_size, sequence_length=self._sequence_length, autostack=self._autostack, learning_rate_schedule=self._learning_rate_schedule, keep_checkpoint_max=self._keep_checkpoint_max, save_checkpoints_steps=self._save_checkpoints_steps, optimizer=self._optimizer, predict_fn=self._predict_fn, variable_filter=self._variable_filter, ensemble_inputs=self._ensemble_inputs, use_tpu=False if disable_tpu else self._tpu, tpu_job_name=self._tpu_job_name, iterations_per_loop=self._iterations_per_loop, cluster=self._cluster, init_checkpoint=init_checkpoint)

The output of fine-tuning is as follows: INFO:tensorflow:Using config: {'_model_dir': './models/small', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=None, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': None} INFO:tensorflow:_TPUContext: eval_on_tpu True WARNING:tensorflow:eval_on_tpu ignored because use_tpu is False. WARNING:tensorflow:From /local/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py:1666: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass *_constraint arguments to layers. WARNING:tensorflow:From /local/lib/python3.6/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.

INFO:absl:Overwrite dataset info from restored data version. INFO:absl:Reusing dataset trivia_qa (./datasets/trivia_qa/unfiltered.nocontext/1.1.0) INFO:absl:Constructing tf.data.Dataset for split train, from ./datasets/trivia_qa/unfiltered.nocontext/1.1.0

INFO:tensorflow:Calling model_fn. INFO:tensorflow:Running train on CPU/GPU INFO:tensorflow:feature inputs : Tensor("Reshape:0", shape=(1, 256, 128), dtype=int32) WARNING:tensorflow:From /workshop/text-to-text-transfer-transformer/mesh_tensorflow/transformer/utils.py:386: Print (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2018-08-20. Instructions for updating: Use tf.print instead of tf.Print. Note that tf.print returns a no-output operator that directly prints the output. Outside of defuns or eager mode, this operator will not be executed unless it is directly specified in session.run or used as a control dependency for other operators. This is only a concern in graph mode. Below is an example of how to ensure tf.print executes in graph mode:

INFO:tensorflow:feature inputs_position : Tensor("Reshape_1:0", shape=(1, 256, 128), dtype=int32) INFO:tensorflow:feature targets : Tensor("Reshape_2:0", shape=(1, 256, 32), dtype=int32) INFO:tensorflow:feature targets_position : Tensor("Reshape_3:0", shape=(1, 256, 32), dtype=int32) INFO:tensorflow:feature inputs_segmentation : Tensor("Reshape_4:0", shape=(1, 256, 128), dtype=int32) INFO:tensorflow:feature targets_segmentation : Tensor("Reshape_5:0", shape=(1, 256, 32), dtype=int32) INFO:tensorflow:serialize_num_microbatches: tokens_per_microbatch_per_replica=8192 batch_dim=Dimension(name='batch', size=256) sequence_length={'inputs': 128, 'targets': 32} batch_per_replica=256 num_microbatches=4 WARNING:tensorflow:Using default tf glorot_uniform_initializer for variable encoder/block_000/layer_000/SelfAttention/relative_attention_bias The initialzer will guess the input and output dimensions based on dimension order. WARNING:tensorflow:Using default tf glorot_uniform_initializer for variable decoder/block_000/layer_000/SelfAttention/relative_attention_bias The initialzer will guess the input and output dimensions based on dimension order. INFO:tensorflow:Variable decoder/block_000/layer_000/SelfAttention/k size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_000/layer_000/SelfAttention/o size 262144 slice_size 262144 Shape[heads=512, d_model=512]
INFO:tensorflow:Variable decoder/block_000/layer_000/SelfAttention/q size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_000/layer_000/SelfAttention/relative_attention_bias size 256 slice_size 256 Shape[heads=8, buckets=32]
INFO:tensorflow:Variable decoder/block_000/layer_000/SelfAttention/v size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_000/layer_000/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable decoder/block_000/layer_001/EncDecAttention/k size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_000/layer_001/EncDecAttention/o size 262144 slice_size 262144 Shape[heads=512, d_model=512]
INFO:tensorflow:Variable decoder/block_000/layer_001/EncDecAttention/q size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_000/layer_001/EncDecAttention/v size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_000/layer_001/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable decoder/block_000/layer_002/DenseReluDense/wi/kernel size 1048576 slice_size 1048576 Shape[d_model=512, d_ff=2048]
INFO:tensorflow:Variable decoder/block_000/layer_002/DenseReluDense/wo/kernel size 1048576 slice_size 1048576 Shape[d_ff=2048, d_model=512]
INFO:tensorflow:Variable decoder/block_000/layer_002/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable decoder/block_001/layer_000/SelfAttention/k size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_001/layer_000/SelfAttention/o size 262144 slice_size 262144 Shape[heads=512, d_model=512]
INFO:tensorflow:Variable decoder/block_001/layer_000/SelfAttention/q size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_001/layer_000/SelfAttention/v size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_001/layer_000/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable decoder/block_001/layer_001/EncDecAttention/k size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_001/layer_001/EncDecAttention/o size 262144 slice_size 262144 Shape[heads=512, d_model=512]
INFO:tensorflow:Variable decoder/block_001/layer_001/EncDecAttention/q size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_001/layer_001/EncDecAttention/v size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_001/layer_001/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable decoder/block_001/layer_002/DenseReluDense/wi/kernel size 1048576 slice_size 1048576 Shape[d_model=512, d_ff=2048]
INFO:tensorflow:Variable decoder/block_001/layer_002/DenseReluDense/wo/kernel size 1048576 slice_size 1048576 Shape[d_ff=2048, d_model=512]
INFO:tensorflow:Variable decoder/block_001/layer_002/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable decoder/block_002/layer_000/SelfAttention/k size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_002/layer_000/SelfAttention/o size 262144 slice_size 262144 Shape[heads=512, d_model=512]
INFO:tensorflow:Variable decoder/block_002/layer_000/SelfAttention/q size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_002/layer_000/SelfAttention/v size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_002/layer_000/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable decoder/block_002/layer_001/EncDecAttention/k size 262144 slice_size 262144 Shape[d_model=512, heads=512]

INFO:tensorflow:Variable decoder/block_002/layer_001/EncDecAttention/o size 262144 slice_size 262144 Shape[heads=512, d_model=512]
INFO:tensorflow:Variable decoder/block_002/layer_001/EncDecAttention/q size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_002/layer_001/EncDecAttention/v size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_002/layer_001/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable decoder/block_002/layer_002/DenseReluDense/wi/kernel size 1048576 slice_size 1048576 Shape[d_model=512, d_ff=2048]
INFO:tensorflow:Variable decoder/block_002/layer_002/DenseReluDense/wo/kernel size 1048576 slice_size 1048576 Shape[d_ff=2048, d_model=512]
INFO:tensorflow:Variable decoder/block_002/layer_002/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable decoder/block_003/layer_000/SelfAttention/k size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_003/layer_000/SelfAttention/o size 262144 slice_size 262144 Shape[heads=512, d_model=512]
INFO:tensorflow:Variable decoder/block_003/layer_000/SelfAttention/q size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_003/layer_000/SelfAttention/v size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_003/layer_000/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable decoder/block_003/layer_001/EncDecAttention/k size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_003/layer_001/EncDecAttention/o size 262144 slice_size 262144 Shape[heads=512, d_model=512]
INFO:tensorflow:Variable decoder/block_003/layer_001/EncDecAttention/q size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_003/layer_001/EncDecAttention/v size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_003/layer_001/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable decoder/block_003/layer_002/DenseReluDense/wi/kernel size 1048576 slice_size 1048576 Shape[d_model=512, d_ff=2048]
INFO:tensorflow:Variable decoder/block_003/layer_002/DenseReluDense/wo/kernel size 1048576 slice_size 1048576 Shape[d_ff=2048, d_model=512]
INFO:tensorflow:Variable decoder/block_003/layer_002/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable decoder/block_004/layer_000/SelfAttention/k size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_004/layer_000/SelfAttention/o size 262144 slice_size 262144 Shape[heads=512, d_model=512]
INFO:tensorflow:Variable decoder/block_004/layer_000/SelfAttention/q size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_004/layer_000/SelfAttention/v size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_004/layer_000/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable decoder/block_004/layer_001/EncDecAttention/k size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_004/layer_001/EncDecAttention/o size 262144 slice_size 262144 Shape[heads=512, d_model=512]
INFO:tensorflow:Variable decoder/block_004/layer_001/EncDecAttention/q size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_004/layer_001/EncDecAttention/v size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_004/layer_001/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable decoder/block_004/layer_002/DenseReluDense/wi/kernel size 1048576 slice_size 1048576 Shape[d_model=512, d_ff=2048]
INFO:tensorflow:Variable decoder/block_004/layer_002/DenseReluDense/wo/kernel size 1048576 slice_size 1048576 Shape[d_ff=2048, d_model=512]
INFO:tensorflow:Variable decoder/block_004/layer_002/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable decoder/block_005/layer_000/SelfAttention/k size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_005/layer_000/SelfAttention/o size 262144 slice_size 262144 Shape[heads=512, d_model=512]
INFO:tensorflow:Variable decoder/block_005/layer_000/SelfAttention/q size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_005/layer_000/SelfAttention/v size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_005/layer_000/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable decoder/block_005/layer_001/EncDecAttention/k size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_005/layer_001/EncDecAttention/o size 262144 slice_size 262144 Shape[heads=512, d_model=512]
INFO:tensorflow:Variable decoder/block_005/layer_001/EncDecAttention/q size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_005/layer_001/EncDecAttention/v size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable decoder/block_005/layer_001/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable decoder/block_005/layer_002/DenseReluDense/wi/kernel size 1048576 slice_size 1048576 Shape[d_model=512, d_ff=2048]

INFO:tensorflow:Variable decoder/block_005/layer_002/DenseReluDense/wo/kernel size 1048576 slice_size 1048576 Shape[d_ff=2048, d_model=512]
INFO:tensorflow:Variable decoder/block_005/layer_002/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable decoder/final_layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable encoder/block_000/layer_000/SelfAttention/k size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable encoder/block_000/layer_000/SelfAttention/o size 262144 slice_size 262144 Shape[heads=512, d_model=512]
INFO:tensorflow:Variable encoder/block_000/layer_000/SelfAttention/q size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable encoder/block_000/layer_000/SelfAttention/relative_attention_bias size 256 slice_size 256 Shape[heads=8, buckets=32]
INFO:tensorflow:Variable encoder/block_000/layer_000/SelfAttention/v size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable encoder/block_000/layer_000/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable encoder/block_000/layer_001/DenseReluDense/wi/kernel size 1048576 slice_size 1048576 Shape[d_model=512, d_ff=2048]
INFO:tensorflow:Variable encoder/block_000/layer_001/DenseReluDense/wo/kernel size 1048576 slice_size 1048576 Shape[d_ff=2048, d_model=512]
INFO:tensorflow:Variable encoder/block_000/layer_001/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable encoder/block_001/layer_000/SelfAttention/k size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable encoder/block_001/layer_000/SelfAttention/o size 262144 slice_size 262144 Shape[heads=512, d_model=512]
INFO:tensorflow:Variable encoder/block_001/layer_000/SelfAttention/q size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable encoder/block_001/layer_000/SelfAttention/v size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable encoder/block_001/layer_000/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable encoder/block_001/layer_001/DenseReluDense/wi/kernel size 1048576 slice_size 1048576 Shape[d_model=512, d_ff=2048]
INFO:tensorflow:Variable encoder/block_001/layer_001/DenseReluDense/wo/kernel size 1048576 slice_size 1048576 Shape[d_ff=2048, d_model=512]
INFO:tensorflow:Variable encoder/block_001/layer_001/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable encoder/block_002/layer_000/SelfAttention/k size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable encoder/block_002/layer_000/SelfAttention/o size 262144 slice_size 262144 Shape[heads=512, d_model=512]
INFO:tensorflow:Variable encoder/block_002/layer_000/SelfAttention/q size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable encoder/block_002/layer_000/SelfAttention/v size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable encoder/block_002/layer_000/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable encoder/block_002/layer_001/DenseReluDense/wi/kernel size 1048576 slice_size 1048576 Shape[d_model=512, d_ff=2048]
INFO:tensorflow:Variable encoder/block_002/layer_001/DenseReluDense/wo/kernel size 1048576 slice_size 1048576 Shape[d_ff=2048, d_model=512]
INFO:tensorflow:Variable encoder/block_002/layer_001/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable encoder/block_003/layer_000/SelfAttention/k size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable encoder/block_003/layer_000/SelfAttention/o size 262144 slice_size 262144 Shape[heads=512, d_model=512]
INFO:tensorflow:Variable encoder/block_003/layer_000/SelfAttention/q size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable encoder/block_003/layer_000/SelfAttention/v size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable encoder/block_003/layer_000/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable encoder/block_003/layer_001/DenseReluDense/wi/kernel size 1048576 slice_size 1048576 Shape[d_model=512, d_ff=2048]
INFO:tensorflow:Variable encoder/block_003/layer_001/DenseReluDense/wo/kernel size 1048576 slice_size 1048576 Shape[d_ff=2048, d_model=512]
INFO:tensorflow:Variable encoder/block_003/layer_001/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable encoder/block_004/layer_000/SelfAttention/k size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable encoder/block_004/layer_000/SelfAttention/o size 262144 slice_size 262144 Shape[heads=512, d_model=512]
INFO:tensorflow:Variable encoder/block_004/layer_000/SelfAttention/q size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable encoder/block_004/layer_000/SelfAttention/v size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable encoder/block_004/layer_000/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable encoder/block_004/layer_001/DenseReluDense/wi/kernel size 1048576 slice_size 1048576 Shape[d_model=512, d_ff=2048]
INFO:tensorflow:Variable encoder/block_004/layer_001/DenseReluDense/wo/kernel size 1048576 slice_size 1048576 Shape[d_ff=2048, d_model=512]
INFO:tensorflow:Variable encoder/block_004/layer_001/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]

INFO:tensorflow:Variable encoder/block_005/layer_000/SelfAttention/k size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable encoder/block_005/layer_000/SelfAttention/o size 262144 slice_size 262144 Shape[heads=512, d_model=512]
INFO:tensorflow:Variable encoder/block_005/layer_000/SelfAttention/q size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable encoder/block_005/layer_000/SelfAttention/v size 262144 slice_size 262144 Shape[d_model=512, heads=512]
INFO:tensorflow:Variable encoder/block_005/layer_000/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable encoder/block_005/layer_001/DenseReluDense/wi/kernel size 1048576 slice_size 1048576 Shape[d_model=512, d_ff=2048]
INFO:tensorflow:Variable encoder/block_005/layer_001/DenseReluDense/wo/kernel size 1048576 slice_size 1048576 Shape[d_ff=2048, d_model=512]
INFO:tensorflow:Variable encoder/block_005/layer_001/layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable encoder/final_layer_norm/scale size 512 slice_size 512 Shape[d_model=512]
INFO:tensorflow:Variable shared/embedding size 16449536 slice_size 16449536 Shape[vocab=32128, d_model=512]
INFO:tensorflow:Trainable Variables count: 131 Total size: 60506624 Total slice_size: 60506624
INFO:tensorflow:All Variables count: 359 Total size: 60691328 Total slice_size: 60691328
INFO:tensorflow:Counters: allconcat: 8.19e+03 allconcat/1: 8.19e+03 allconcat/1/reshape_op: 8.19e+03 allreduce: 2.1e+08 allreduce/[0]: 1.49e+08 allreduce/[0]/einsum_op: 1.49e+08 allreduce/[0]/reduce_op: 5.81e+04 allreduce/[1]: 6.06e+07 allreduce/[1]/einsum_op: 6.05e+07 allreduce/[1]/reduce_op: 1.39e+05 einsum: 1.31e+12 einsum_unique: 1.31e+12 output: 7.28e+08 output/AddOperation: 1.85e+05 output/Constant: 1 output/EinsumOperation: 2.42e+08 output/ImportOperation: 1.23e+05 output/MinMaxOperation: 262 output/ReduceOperation: 1.68e+05 output/ReshapeOperation: 1.31e+05 output/ScalarAddOperation: 6.05e+07 output/ScalarMultiplyOperation: 5.38e+05 output/SlicewiseOperation: 3.03e+08 output/Variable: 6.07e+07 output/WhileLoopOperation: 6.05e+07 output_unique: 7.28e+08 output_unique/AddOperation: 1.85e+05 output_unique/Constant: 1 output_unique/EinsumOperation: 2.42e+08 output_unique/ImportOperation: 1.23e+05 output_unique/MinMaxOperation: 262 output_unique/ReduceOperation: 1.68e+05 output_unique/ReshapeOperation: 1.31e+05 output_unique/ScalarAddOperation: 6.05e+07 output_unique/ScalarMultiplyOperation: 5.38e+05 output_unique/SlicewiseOperation: 3.03e+08 output_unique/Variable: 6.07e+07 output_unique/WhileLoopOperation: 6.05e+07 variables: 6.07e+07 variables/trainable: 6.05e+07 variables/untrainable: 1.85e+05 INFO:tensorflow:Initializing variables from ./pretrained_models/small/model.ckpt-1000000: INFO:tensorflow:Variables in ./pretrained_models/small/model.ckpt-1000000 but not in graph: INFO:tensorflow: INFO:tensorflow:Variables in graph but not in ./pretrained_models/small/model.ckpt-1000000: INFO:tensorflow: INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Starting the session. INFO:tensorflow:Graph was finalized. INFO:tensorflow:Restoring parameters from ./models/small/model.ckpt-1000000 WARNING:tensorflow:From /local/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1077: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file utilities to get mtimes. INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Before copy master to slices. INFO:tensorflow:Done with copy master to slices. INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 1000000... INFO:tensorflow:Before Save. INFO:tensorflow:About to write a checkpoint INFO:tensorflow:Saving checkpoints for 1000000 into ./models/small/model.ckpt. INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 1000000... INFO:tensorflow:Done writing checkpoint.

Even using the pre-trained small model, actual fine-tuning is done on CPU. GPU memory is fully allocated and GPU utilization for compute is always near 0%.

masa-ita commented 4 years ago

I've experienced a similar issue while training a small model with the tfds japanese wikipedia dataset. More over, after 4 steps, the tf_mesh_transformer seems to hang.

My operative_config.gin is as follows:

import mesh_tensorflow.optimize import mesh_tensorflow.transformer.learning_rate_schedules import mesh_tensorflow.transformer.transformer_layers import t5.data.mixtures import t5.data.sentencepiece_vocabulary import t5.models.mesh_transformer

Macros:

==============================================================================

d_ff = 2048 d_kv = 64 d_model = 512 dropout_rate = 0.1 inputs_length = 512 mean_noise_span_length = 3.0 MIXTURE_NAME = 'wikipedia_20190301.ja_unsupervised' noise_density = 0.15 num_heads = 8 num_layers = 6 targets_length = @preprocessors.random_spans_targets_length()

Parameters for AdafactorOptimizer:

==============================================================================

AdafactorOptimizer.beta1 = 0.0 AdafactorOptimizer.clipping_threshold = 1.0 AdafactorOptimizer.decay_rate = None AdafactorOptimizer.epsilon1 = 1e-30 AdafactorOptimizer.epsilon2 = 0.001 AdafactorOptimizer.factored = True AdafactorOptimizer.min_dim_size_to_factor = 128 AdafactorOptimizer.multiply_by_parameter_scale = True

Parameters for Bitransformer:

==============================================================================

Bitransformer.shared_embedding = True

Parameters for denoise:

==============================================================================

denoise.inputs_fn = @preprocessors.noise_span_to_unique_sentinel denoise.noise_density = %noise_density denoise.noise_mask_fn = @preprocessors.random_spans_noise_mask denoise.targets_fn = @preprocessors.nonnoise_span_to_unique_sentinel

Parameters for decoder/DenseReluDense:

==============================================================================

decoder/DenseReluDense.activation = 'relu' decoder/DenseReluDense.dropout_rate = %dropout_rate decoder/DenseReluDense.hidden_size = %d_ff

Parameters for encoder/DenseReluDense:

==============================================================================

encoder/DenseReluDense.activation = 'relu' encoder/DenseReluDense.dropout_rate = %dropout_rate encoder/DenseReluDense.hidden_size = %d_ff

Parameters for decoder/EncDecAttention:

==============================================================================

None.

Parameters for get_sentencepiece_model_path:

==============================================================================

get_sentencepiece_model_path.mixture_or_task_name = %MIXTURE_NAME

Parameters for get_variable_dtype:

==============================================================================

get_variable_dtype.activation_dtype = 'bfloat16'

Parameters for get_vocab_embedding_cls:

==============================================================================

None.

Parameters for decoder/LayerStack:

==============================================================================

decoder/LayerStack.dropout_rate = %dropout_rate decoder/LayerStack.norm_epsilon = 1e-06

Parameters for encoder/LayerStack:

==============================================================================

encoder/LayerStack.dropout_rate = %dropout_rate encoder/LayerStack.norm_epsilon = 1e-06

Parameters for learning_rate_schedule_noam:

==============================================================================

learning_rate_schedule_noam.linear_decay_fraction = 0.0 learning_rate_schedule_noam.multiplier = 1.0 learning_rate_schedule_noam.offset = 0 learning_rate_schedule_noam.warmup_steps = 10000

Parameters for make_bitransformer:

==============================================================================

make_bitransformer.decoder_name = 'decoder' make_bitransformer.encoder_name = 'encoder'

Parameters for decoder/make_layer_stack:

==============================================================================

decoder/make_layer_stack.block_scope = True decoder/make_layer_stack.layers = \ [@mesh_tensorflow.transformer.transformer_layers.SelfAttention, @mesh_tensorflow.transformer.transformer_layers.EncDecAttention, @mesh_tensorflow.transformer.transformer_layers.DenseReluDense] decoder/make_layer_stack.num_layers = %num_layers decoder/make_layer_stack.use_universal_transformer = False

Parameters for encoder/make_layer_stack:

==============================================================================

encoder/make_layer_stack.block_scope = True encoder/make_layer_stack.layers = \ [@mesh_tensorflow.transformer.transformer_layers.SelfAttention, @mesh_tensorflow.transformer.transformer_layers.DenseReluDense] encoder/make_layer_stack.num_layers = %num_layers encoder/make_layer_stack.use_universal_transformer = False

Parameters for mesh_train_dataset_fn:

==============================================================================

mesh_train_dataset_fn.mixture_or_task_name = %MIXTURE_NAME mesh_train_dataset_fn.use_cached = False

Parameters for noise_span_to_unique_sentinel:

==============================================================================

None.

Parameters for nonnoise_span_to_unique_sentinel:

==============================================================================

None.

Parameters for num_parallel_calls:

==============================================================================

num_parallel_calls.deterministic = False

Parameters for pack_dataset:

==============================================================================

pack_dataset.use_custom_ops = False

Parameters for pack_or_pad:

==============================================================================

None.

Parameters for random_spans_helper:

==============================================================================

random_spans_helper.extra_tokens_per_span_inputs = 1 random_spans_helper.extra_tokens_per_span_targets = 1 random_spans_helper.inputs_length = %inputs_length random_spans_helper.mean_noise_span_length = %mean_noise_span_length random_spans_helper.noise_density = %noise_density

Parameters for targets_length/random_spans_helper:

==============================================================================

targets_length/random_spans_helper.extra_tokens_per_span_inputs = 1 targets_length/random_spans_helper.extra_tokens_per_span_targets = 1 targets_length/random_spans_helper.inputs_length = %inputs_length targets_length/random_spans_helper.mean_noise_span_length = %mean_noise_span_length targets_length/random_spans_helper.noise_density = %noise_density

Parameters for random_spans_noise_mask:

==============================================================================

random_spans_noise_mask.mean_noise_span_length = %mean_noise_span_length

Parameters for targets_length/random_spans_targets_length:

==============================================================================

None.

Parameters for random_spans_tokens_length:

==============================================================================

None.

Parameters for reduce_concat_tokens:

==============================================================================

reduce_concat_tokens.batch_size = 128 reduce_concat_tokens.feature_key = 'targets'

Parameters for run:

==============================================================================

run.autostack = True run.batch_size = ('tokens_per_batch', 65536) run.dataset_split = 'train' run.ensemble_inputs = None run.eval_checkpoint_step = None run.eval_dataset_fn = None run.eval_summary_dir = None run.export_path = '' run.init_checkpoint = None run.iterations_per_loop = 100 run.keep_checkpoint_max = None run.layout_rules = \ 'ensemble:ensemble,batch:batch,d_ff:model,heads:model,vocab:model,experts:batch' run.learning_rate_schedule = @learning_rate_schedules.learning_rate_schedule_noam run.mesh_devices = ['gpu:0'] run.mesh_shape = 'model:1,batch:1' run.mode = 'train' run.model_type = 'bitransformer' run.optimizer = @optimize.AdafactorOptimizer run.perplexity_eval_steps = 100 run.predict_fn = None run.save_checkpoints_steps = 5000 run.sequence_length = {'inputs': %inputs_length, 'targets': %targets_length} run.train_dataset_fn = @t5.models.mesh_transformer.mesh_train_dataset_fn run.train_steps = 524288 run.variable_filter = None run.vocabulary = @t5.data.sentencepiece_vocabulary.SentencePieceVocabulary()

Parameters for select_random_chunk:

==============================================================================

select_random_chunk.feature_key = 'targets' select_random_chunk.max_length = 65536

Parameters for decoder/SelfAttention:

==============================================================================

decoder/SelfAttention.attention_func = None decoder/SelfAttention.attention_kwargs = None decoder/SelfAttention.dropout_rate = %dropout_rate decoder/SelfAttention.key_value_size = %d_kv decoder/SelfAttention.num_heads = %num_heads decoder/SelfAttention.num_memory_heads = 0 decoder/SelfAttention.relative_attention_num_buckets = 32 decoder/SelfAttention.relative_attention_type = 'bias_shared' decoder/SelfAttention.shared_kv = False

Parameters for encoder/SelfAttention:

==============================================================================

encoder/SelfAttention.attention_func = None encoder/SelfAttention.attention_kwargs = None encoder/SelfAttention.dropout_rate = %dropout_rate encoder/SelfAttention.key_value_size = %d_kv encoder/SelfAttention.num_heads = %num_heads encoder/SelfAttention.num_memory_heads = 0 encoder/SelfAttention.relative_attention_num_buckets = 32 encoder/SelfAttention.relative_attention_type = 'bias_shared' encoder/SelfAttention.shared_kv = False

Parameters for SentencePieceVocabulary:

==============================================================================

SentencePieceVocabulary.extra_ids = 100 SentencePieceVocabulary.sentencepiece_model_file = \ @t5.models.mesh_transformer.get_sentencepiece_model_path()

Parameters for serialize_num_microbatches:

==============================================================================

serialize_num_microbatches.tokens_per_microbatch_per_replica = 2048

Parameters for shift_targets:

==============================================================================

shift_targets.bos_id = 0 shift_targets.eos_id = 1

Parameters for split_tokens:

==============================================================================

split_tokens.feature_key = 'targets' split_tokens.max_tokens_per_segment = @preprocessors.random_spans_tokens_length() split_tokens.min_tokens_per_segment = None

Parameters for tpu_estimator_model_fn:

==============================================================================

tpu_estimator_model_fn.model_info_file = None tpu_estimator_model_fn.outer_batch_size = 1 tpu_estimator_model_fn.tpu_summaries = False

Parameters for decoder/Unitransformer:

==============================================================================

decoder/Unitransformer.d_model = %d_model decoder/Unitransformer.ensemble = None decoder/Unitransformer.input_full_attention = False decoder/Unitransformer.label_smoothing = 0.0 decoder/Unitransformer.loss_denominator = None decoder/Unitransformer.loss_fn = None decoder/Unitransformer.loss_on_targets_only = False decoder/Unitransformer.max_length = 512 decoder/Unitransformer.positional_embedding = False decoder/Unitransformer.shared_embedding_and_softmax_weights = True decoder/Unitransformer.token_dropout_rate = 0.0 decoder/Unitransformer.vocab_divisor = 128 decoder/Unitransformer.z_loss = 0.0001

Parameters for encoder/Unitransformer:

==============================================================================

encoder/Unitransformer.d_model = %d_model encoder/Unitransformer.ensemble = None encoder/Unitransformer.input_full_attention = False encoder/Unitransformer.label_smoothing = 0.0 encoder/Unitransformer.loss_denominator = None encoder/Unitransformer.loss_fn = None encoder/Unitransformer.loss_on_targets_only = False encoder/Unitransformer.max_length = 512 encoder/Unitransformer.positional_embedding = False encoder/Unitransformer.shared_embedding_and_softmax_weights = True encoder/Unitransformer.token_dropout_rate = 0.0 encoder/Unitransformer.vocab_divisor = 128 encoder/Unitransformer.z_loss = 0.0001

Parameters for unsupervised:

==============================================================================

unsupervised.preprocessors = \ [@preprocessors.select_random_chunk, @preprocessors.reduce_concat_tokens, @preprocessors.split_tokens, @preprocessors.denoise]

Parameters for VarianceScalingInitializer:

==============================================================================

VarianceScalingInitializer.distribution = 'normal' VarianceScalingInitializer.mode = 'fan_in' VarianceScalingInitializer.scale = 1.0

Parameters for VocabEmbedding:

==============================================================================

None.

mcompute commented 4 years ago

I shared the same sentiment on mesh transformer's performance on CPU/GPU environment.

adarob commented 4 years ago

This is now fixed (for real!) in v0.3.2