tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.33k stars 3.47k forks source link

T2T stuck on 1st epoch while training mnist using docker on CPU #589

Closed saurabhvyas closed 6 years ago

saurabhvyas commented 6 years ago

Python version 3.5 TF : 1.5

t2t-trainer \
  --generate_data \
  --data_dir=~/t2t_data \
  --output_dir=~/t2t_train/mnist \
  --problems=image_mnist \
  --model=shake_shake \
  --hparams_set=shake_shake_quick \
  --train_steps=1000 \
  --eval_steps=100

usr/local/lib/python3.5/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
INFO:tensorflow:Generating data for image_mnist
INFO:tensorflow:Not downloading, file already found: /tmp/t2t_datagen/train-images-idx3-ubyte.gz
INFO:tensorflow:Not downloading, file already found: /tmp/t2t_datagen/train-labels-idx1-ubyte.gz
INFO:tensorflow:Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to /tmp/t2t_datagen/t10k-images-idx3-ubyte.gz
100% completed
INFO:tensorflow:Successfully downloaded t10k-images-idx3-ubyte.gz, 1648877 bytes.
INFO:tensorflow:Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to /tmp/t2t_datagen/t10k-labels-idx1-ubyte.gz
180% completed
INFO:tensorflow:Successfully downloaded t10k-labels-idx1-ubyte.gz, 4542 bytes.
INFO:tensorflow:Not downloading, file already found: /tmp/t2t_datagen/train-images-idx3-ubyte.gz
INFO:tensorflow:Not downloading, file already found: /tmp/t2t_datagen/train-labels-idx1-ubyte.gz
INFO:tensorflow:Not downloading, file already found: /tmp/t2t_datagen/t10k-images-idx3-ubyte.gz
INFO:tensorflow:Not downloading, file already found: /tmp/t2t_datagen/t10k-labels-idx1-ubyte.gz
2018-02-15 16:35:09.287963: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
INFO:tensorflow:Shuffling data...
INFO:tensorflow:Found unparsed command-line arguments. Checking if any start with --hp_ and interpreting those as hparams settings.
WARNING:tensorflow:Found unknown flag: 00
INFO:tensorflow:schedule=continuous_train_and_eval
INFO:tensorflow:worker_gpu=1
INFO:tensorflow:sync=False
WARNING:tensorflow:Schedule=continuous_train_and_eval. Assuming that training is running on a single machine.
INFO:tensorflow:datashard_devices: ['gpu:0']
INFO:tensorflow:caching_devices: None
INFO:tensorflow:ps_devices: ['gpu:0']
INFO:tensorflow:Using config: {'_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.95
}
allow_soft_placement: true
graph_options {
  optimizer_options {
  }
}
, '_save_checkpoints_steps': 1000, '_evaluation_master': '', '_tf_random_seed': 1234, '_num_ps_replicas': 0, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, 't2t_device_info': {'num_async_replicas': 1}, '_model_dir': '/root/t2t_train/mnist', '_task_type': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f19a88dfc88>, '_is_chief': True, 'use_tpu': False, '_keep_checkpoint_every_n_hours': 10000, '_num_worker_replicas': 0, '_log_step_count_steps': 100, '_keep_checkpoint_max': 20, '_environment': 'local', '_save_checkpoints_secs': None, '_master': '', 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0x7f19cab75f60>, '_task_id': 0, '_save_summary_steps': 100}
WARNING:tensorflow:Estimator's model_fn (<function T2TModel.make_estimator_model_fn.<locals>.wrapping_model_fn at 0x7f19a88de950>)includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:ValidationMonitor only works with --schedule=train_and_evaluate
WARNING:tensorflow:Experiment.continuous_train_and_eval (from tensorflow.contrib.learn.python.learn.experiment) is experimental and may change or be removed at any time, and without warning.
INFO:tensorflow:Training model for 1000 steps
INFO:tensorflow:Reading data files from /root/t2t_data/image_mnist-train*
INFO:tensorflow:partition: 0 num_data_files: 10
INFO:tensorflow:Setting T2TModel mode to 'train'
INFO:tensorflow:Using variable initializer: uniform_unit_scaling
INFO:tensorflow:Transforming feature 'inputs' with image_modality.bottom
INFO:tensorflow:Transforming 'targets' with class_label_modality_10_32.targets_bottom
INFO:tensorflow:Building model body
INFO:tensorflow:Transforming body output with class_label_modality_10_32.top
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensor2tensor/layers/modalities.py:464: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensor2tensor/layers/common_layers.py:1707: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

INFO:tensorflow:Base learning rate: 0.500000
INFO:tensorflow:Applying exp learning rate warmup for 100 steps
INFO:tensorflow:Applying learning rate decay: cosine.
INFO:tensorflow:Applying weight decay, decay_rate: 0.00010
INFO:tensorflow:Trainable Variables Total size: 2926698
INFO:tensorflow:Using optimizer Adam
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /root/t2t_train/mnist/model.ckpt.
INFO:tensorflow:loss = 9.228172, step = 1
^CTraceback (most recent call last):
saurabhvyas commented 6 years ago

It turns out, I just need to change batch size to something suited for my i5 CPU, like batch size =2 works fine, thanks to @martinpopel . you can add following argument when running from terminal --hparams="batch_size=123

stefan-falk commented 5 years ago

@saurabhvyas How does this work? Can I set multiple hparams like that and overwrite some from an hparam_set?

Something like

t2t-trainer \
  --hparams="learning_rate=0.1337,learning_rate_decay_schem=rsqrt_decay"
  # ..
martinpopel commented 5 years ago

Yes, you can use multiple hparams on the command line and override hparam_set this way.

stefan-falk commented 5 years ago

Nvm, I used the wrong setting.


@martinpopel Thanks - I'm asking because I can't seem to be able to set the following:

--hparams='learning_rate=0.15,learning_rate_decay_scheme=exp_decay,learning_rate_schedule=exp_decay'

For the transformer model. For some reason the learning_rate starts with a value of 1 and remains constant over training steps. :/