tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.5k stars 3.49k forks source link

Tensor2Tensor doesn't train on Windows? #1141

Closed hoang-ho closed 6 years ago

hoang-ho commented 6 years ago

Description

I'm very new to Tensor2Tensor and am trying to use Tensor2Tensor for text summarization. As I run the following command:

set CUDA_VISIBLE_DEVICES=0 & python "C:\Users\tuj23380\AppData\Roaming\Python\Python35\site-packages\tensor2tensor-1.9.0-py3.5.egg\tensor2tensor\bin\t2t_trainer.py" --data_dir t2t_data --output_dir t2t_train --problem summarize_cnn_dailymail32k --model transformer --hparams_set transformer_prepend --batch_size 128 --train_steps 100000

on a Windows 10 machine. I received the following result:

...

INFO:tensorflow:Found unparsed command-line arguments. Checking if any start with --hp_ and interpreting those as hparams settings.
WARNING:tensorflow:Found unknown flag: --batch_size
WARNING:tensorflow:Found unknown flag: 128
WARNING:tensorflow:From C:\Users\tuj23380\AppData\Roaming\Python\Python35\site-packages\tensor2tensor-1.9.0-py3.5.egg\tensor2tensor\utils\trainer_lib.py:199: RunConfig.__init__ (from tensorflow.contrib.learn.python.learn.estimators.run_config) is deprecated and will be removed in a future version.
Instructions for updating:
When switching to tf.estimator.Estimator, use tf.estimator.RunConfig instead.
INFO:tensorflow:schedule=continuous_train_and_eval
INFO:tensorflow:worker_gpu=1
INFO:tensorflow:sync=False
WARNING:tensorflow:Schedule=continuous_train_and_eval. Assuming that training is running on a single machine.
INFO:tensorflow:datashard_devices: ['gpu:0']
INFO:tensorflow:caching_devices: None
INFO:tensorflow:ps_devices: ['gpu:0']
INFO:tensorflow:Using config: {'_keep_checkpoint_max': 20, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.95
}
allow_soft_placement: true
graph_options {
  optimizer_options {
  }
}
, '_train_distribute': None, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_save_summary_steps': 100, 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0x0000026FEC78E4E0>, '_master': '', '_eval_distribute': None, '_num_ps_replicas': 0, 't2t_device_info': {'num_async_replicas': 1}, '_tf_random_seed': None, '_task_type': None, '_is_chief': True, '_save_checkpoints_steps': 1000, 'use_tpu': False, '_log_step_count_steps': 100, '_save_checkpoints_secs': None, '_protocol': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_environment': 'local', '_num_worker_replicas': 0, '_task_id': 0, '_device_fn': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x0000026FEC78E518>, '_model_dir': 't2t_train'}
WARNING:tensorflow:Estimator's model_fn (<function T2TModel.make_estimator_model_fn.<locals>.wrapping_model_fn at 0x0000026FEC83E6A8>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:ValidationMonitor only works with --schedule=train_and_evaluate
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 1000 or save_checkpoints_secs None.
INFO:tensorflow:Reading data files from t2t_data\summarize_cnn_dailymail32k-train*
INFO:tensorflow:partition: 0 num_data_files: 100
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Setting T2TModel mode to 'train'
INFO:tensorflow:Using variable initializer: uniform_unit_scaling
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_24_512.bottom
WARNING:tensorflow:From C:\Users\tuj23380\AppData\Roaming\Python\Python35\site-packages\tensorflow\python\framework\function.py:988: calling Graph.create_op (from tensorflow.python.framework.ops) with compute_shapes is deprecated and will be removed in a future version.
Instructions for updating:
Shapes are always computed; don't use the compute_shapes as it has no effect.
INFO:tensorflow:Transforming 'targets' with symbol_modality_24_512.targets_bottom
INFO:tensorflow:Building model body
INFO:tensorflow:Transforming body output with symbol_modality_24_512.top
INFO:tensorflow:Base learning rate: 0.200000
INFO:tensorflow:Trainable Variables Total size: 44132352
INFO:tensorflow:Using optimizer Adam
C:\Users\tuj23380\AppData\Roaming\Python\Python35\site-packages\tensorflow\python\ops\gradients_impl.py:108: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2018-10-12 17:16:49.875103: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-10-12 17:16:50.181600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:65:00.0
totalMemory: 8.00GiB freeMemory: 6.59GiB
2018-10-12 17:16:50.185250: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2018-10-12 17:16:51.054320: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-12 17:16:51.056696: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0
2018-10-12 17:16:51.058678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N
2018-10-12 17:16:51.060683: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7782 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:65:00.0, compute capability: 6.1)
2018-10-12 17:16:51.067800: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 7.60G (8160437760 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2018-10-12 17:16:51.070589: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 6.84G (7344393728 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
INFO:tensorflow:Restoring parameters from t2t_train\model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into t2t_train\model.ckpt.
INFO:tensorflow:Reading data files from t2t_data\summarize_cnn_dailymail32k-dev*
INFO:tensorflow:partition: 0 num_data_files: 10
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Setting T2TModel mode to 'eval'
INFO:tensorflow:Setting hparams.relu_dropout to 0.0
INFO:tensorflow:Setting hparams.dropout to 0.0
INFO:tensorflow:Setting hparams.label_smoothing to 0.0
INFO:tensorflow:Setting hparams.layer_prepostprocess_dropout to 0.0
INFO:tensorflow:Setting hparams.symbol_dropout to 0.0
INFO:tensorflow:Setting hparams.attention_dropout to 0.0
INFO:tensorflow:Using variable initializer: uniform_unit_scaling
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_24_512.bottom
INFO:tensorflow:Transforming 'targets' with symbol_modality_24_512.targets_bottom
INFO:tensorflow:Building model body
INFO:tensorflow:Transforming body output with symbol_modality_24_512.top
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-10-12-21:17:26
INFO:tensorflow:Graph was finalized.
2018-10-12 17:17:27.186063: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2018-10-12 17:17:27.190329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-12 17:17:27.195011: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0
2018-10-12 17:17:27.198137: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N
2018-10-12 17:17:27.200000: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7782 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:65:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from t2t_train\model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-10-12-21:17:30
INFO:tensorflow:Saving dict for global step 0: global_step = 0, loss = 0.0, metrics-summarize_cnn_dailymail32k/targets/accuracy = 0.0, metrics-summarize_cnn_dailymail32k/targets/accuracy_per_sequence = 0.0, metrics-summarize_cnn_dailymail32k/targets/accuracy_top5 = 0.0, metrics-summarize_cnn_dailymail32k/targets/approx_bleu_score = 0.0, metrics-summarize_cnn_dailymail32k/targets/neg_log_perplexity = 0.0, metrics-summarize_cnn_dailymail32k/targets/rouge_2_fscore = 0.0, metrics-summarize_cnn_dailymail32k/targets/rouge_L_fscore = 0.0
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 0: t2t_train\model.ckpt-0
INFO:tensorflow:Loss for final step: None.
INFO:tensorflow:Reading data files from t2t_data\summarize_cnn_dailymail32k-dev*
INFO:tensorflow:partition: 0 num_data_files: 10
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Setting T2TModel mode to 'eval'
INFO:tensorflow:Setting hparams.relu_dropout to 0.0
INFO:tensorflow:Setting hparams.dropout to 0.0
INFO:tensorflow:Setting hparams.label_smoothing to 0.0
INFO:tensorflow:Setting hparams.layer_prepostprocess_dropout to 0.0
INFO:tensorflow:Setting hparams.symbol_dropout to 0.0
INFO:tensorflow:Setting hparams.attention_dropout to 0.0
INFO:tensorflow:Using variable initializer: uniform_unit_scaling
INFO:tensorflow:Transforming feature 'inputs' with symbol_modality_24_512.bottom
INFO:tensorflow:Transforming 'targets' with symbol_modality_24_512.targets_bottom
INFO:tensorflow:Building model body
INFO:tensorflow:Transforming body output with symbol_modality_24_512.top
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-10-12-21:17:41
INFO:tensorflow:Graph was finalized.
2018-10-12 17:17:41.385965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2018-10-12 17:17:41.389085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-12 17:17:41.391884: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0
2018-10-12 17:17:41.393844: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N
2018-10-12 17:17:41.395697: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7782 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:65:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from t2t_train\model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-10-12-21:17:45
INFO:tensorflow:Saving dict for global step 0: global_step = 0, loss = 0.0, metrics-summarize_cnn_dailymail32k/targets/accuracy = 0.0, metrics-summarize_cnn_dailymail32k/targets/accuracy_per_sequence = 0.0, metrics-summarize_cnn_dailymail32k/targets/accuracy_top5 = 0.0, metrics-summarize_cnn_dailymail32k/targets/approx_bleu_score = 0.0, metrics-summarize_cnn_dailymail32k/targets/neg_log_perplexity = 0.0, metrics-summarize_cnn_dailymail32k/targets/rouge_2_fscore = 0.0, metrics-summarize_cnn_dailymail32k/targets/rouge_L_fscore = 0.0
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 0: t2t_train\model.ckpt-0

From here, it just stops training. I tried to run the model on Google Colab before, and the model there just kept on running until the very end. One side note is that when I use pip install to install tensor2tensor on Windows, there is no binary files for t2t_datagen or t2t_trainer installed. Hence, I have to use

python "C:\Users\tuj23380\AppData\Roaming\Python\Python35\site-packages\tensor2tensor-1.9.0-py3.5.egg\tensor2tensor\bin\t2t_trainer.py"

to run for t2t_trainer. Is this the reason why tensor2tensor doesn't train? Would anyone here please help me with this error? Thank you very much.

stefan-falk commented 6 years ago

@kaihoang I'm definitely no export on t2t-trainer yet but I think just running the t2t_trainer.py file should work.

Here is an excerpt of what I am doing in one of my test scripts were I"inject" command line args and then just call t2t_trainer.main(None):

import sys
from tensor2tensor.bin import t2t_trainer
from tensor2tensor.utils.metrics import METRICS_FNS

# ...

METRICS_FNS['word_error_rate'] = word_error_rate

if __name__ == '__main__':
    argv = [
        '--generate_data',
        '--problem', 'librispeech',
        '--model', 'transformer',
        # ..
    ]
    sys.argv += argv

    t2t_trainer.main(None)

Have you tried to run other examples on your Windows machine? Maybe try to run the MNIST problem to see if things work.

guotong1988 commented 6 years ago

It can run on windows. I tried mnist. I set is_single_machine = True

stefan-falk commented 6 years ago

@guotong1988 Hmm.. no idea then. Sorry.

hoang-ho commented 6 years ago

Thank you for all the replies. I haven't yet figured out why the model didn't run on my system. It may be due to some issues with how my system is set up. I switched to Linux system and rebuild everything. The model works now. Thank you again

stefan-falk commented 6 years ago

@kaihoang You're welcome. Glad it works now for you.

stefan-falk commented 6 years ago

@kaihoang I have a suspicion: As I generated my own dataset I had an exception thrown. However, under mydataset/data the shards where created - although without any content. Each file had just 0 bytes. As I started t2t-trainer I saw the same behavior that you described: t2t-trainer starts, creates the model and then stops training without any errors.

If you're still able to do so you might want to check your dataset files and whether they were generated correctly.