Esaada commented 6 years ago

Description

Trying to train Transformer model, After following the instructions, I ran the training command, and the training stuck at the beginning at "saving checkpoint" phase:

INFO:tensorflow:Saving checkpoints for 0 into ./t2t_train/librispeech/transformer-transformer_base_single_gpu/model.ckpt.

when I say stuck, I mean 24 hours in that phase.

Environment information

OS: Linux Ubuntu 16.04

tensor2tensor==1.8.0
tensorboard==1.10.0
tensorflow==1.10.0
tensorflow-gpu==1.0.1
tensorpack==0.3.0

Python 2.7.12

# Steps to reproduce:

PROBLEM=librispeech
MODEL=transformer
HPARAMS=transformer_base_single_gpu
DATA_DIR=./t2t_data
TMP_DIR=/tmp/t2t_datagen
TRAIN_DIR=./t2t_train/$PROBLEM/$MODEL-$HPARAMS

mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR

# Generate data
t2t-datagen \
  --data_dir=$DATA_DIR \
  --tmp_dir=$TMP_DIR \
  --problem=$PROBLEM

t2t-trainer \
  --data_dir=$DATA_DIR \
  --problem=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR

# Error logs:
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensor2tensor/utils/trainer_lib.py:198: __init__ (from tensorflow.contrib.learn.python.learn.estimators.run_config) is deprecated and will be removed in a future version.
Instructions for updating:
When switching to tf.estimator.Estimator, use tf.estimator.RunConfig instead.
INFO:tensorflow:schedule=continuous_train_and_eval
INFO:tensorflow:worker_gpu=1
INFO:tensorflow:sync=False
WARNING:tensorflow:Schedule=continuous_train_and_eval. Assuming that training is running on a single machine.
INFO:tensorflow:datashard_devices: ['gpu:0']
INFO:tensorflow:caching_devices: None
INFO:tensorflow:ps_devices: ['gpu:0']
INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_keep_checkpoint_max': 20, '_task_type': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f3162074750>, '_keep_checkpoint_every_n_hours': 10000, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.95
}
allow_soft_placement: true
graph_options {
  optimizer_options {
  }
}
, 'use_tpu': False, '_tf_random_seed': None, '_device_fn': None, '_num_worker_replicas': 0, '_task_id': 0, '_log_step_count_steps': 100, '_evaluation_master': '', 't2t_device_info': {'num_async_replicas': 1}, '_num_ps_replicas': 0, '_train_distribute': None, '_is_chief': True, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_save_checkpoints_steps': 1000, '_environment': 'local', '_master': '', '_model_dir': './t2t_train/librispeech/transformer-transformer_base_single_gpu', 'data_parallelism': <tensor2tensor.utils.expert_utils.Parallelism object at 0x7f3162074790>, '_save_summary_steps': 100}
WARNING:tensorflow:Estimator's model_fn (<function wrapping_model_fn at 0x7f3161ff3488>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:ValidationMonitor only works with --schedule=train_and_evaluate
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 1000 or save_checkpoints_secs None.
INFO:tensorflow:Reading data files from ./t2t_data/librispeech-train*
INFO:tensorflow:partition: 0 num_data_files: 100
WARNING:tensorflow:Shapes are not fully defined. Assuming batch_size means tokens.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Unsetting shared_embedding_and_softmax_weights.
INFO:tensorflow:Setting T2TModel mode to 'train'
INFO:tensorflow:Using variable initializer: uniform_unit_scaling
INFO:tensorflow:Transforming feature 'inputs' with speech_recognition_modality.bottom
INFO:tensorflow:Transforming 'targets' with symbol_modality_256_512.targets_bottom
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/function.py:986: calling create_op (from tensorflow.python.framework.ops) with compute_shapes is deprecated and will be removed in a future version.
Instructions for updating:
Shapes are always computed; don't use the compute_shapes as it has no effect.
INFO:tensorflow:Building model body
INFO:tensorflow:Transforming body output with symbol_modality_256_512.top
INFO:tensorflow:Base learning rate: 2.000000
INFO:tensorflow:Trainable Variables Total size: 48270208
INFO:tensorflow:Using optimizer Adam
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:108: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2018-10-09 08:47:30.831895: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
INFO:tensorflow:Restoring parameters from ./t2t_train/librispeech/transformer-transformer_base_single_gpu/model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
    INFO:tensorflow:Saving checkpoints for 0 into ./t2t_train/librispeech/transformer-transformer_base_single_gpu/model.ckpt.

maninet commented 6 years ago

1088

HPARAMS should be one of these:

transformer_librispeech
transformer_librispeech_tpu
transformer_librispeech_tpu_v1
transformer_librispeech_tpu_v2
transformer_librispeech_v1
transformer_librispeech_v2

Esaada commented 6 years ago

First of all, thanks, it worked. But, it doesn't run on my GPU (and I'm sure I have +CUDDA install), what am I doing wrong?!

Although I got those line: "INFO:tensorflow:worker_gpu=1 INFO:tensorflow:sync=False WARNING:tensorflow:Schedule=continuous_train_and_eval. Assuming that training is running on a single machine. INFO:tensorflow:datashard_devices: ['gpu:0'] INFO:tensorflow:caching_devices: None INFO:tensorflow:ps_devices: ['gpu:0'] INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_keep_checkpoint_max': 20, '_task_type': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9f5dd6a750>, '_keep_checkpoint_every_n_hours': 10000, '_session_config': gpu_options { per_process_gpu_memory_fraction: 0.95"

The reason I suspect of not using the GPU is when I'm doing nvidia-smi-> I'm seeing almost no use of the GPU memory: 289MiB/122285MiB, and the fact that I'm using tons of CPU.

maninet commented 6 years ago

you can "pip uninstall tensorflow==1.10.0" ,then update your tensorflow-gpu to the newest one.

Esaada commented 6 years ago

Thanks, it worked also, I know it because now I have new and bigger problems. first I got this:

E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 1.25G (1339080448 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

The run didn't crashed, and I got this: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED

still didn't crashed, and I got this in the end: E tensorflow/stream_executor/cuda/cuda_blas.cc:464] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED

and this, right after: E tensorflow/stream_executor/cuda/cuda_dnn.cc:353] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

andddd then it crashed. Looked online, stil can't find an helpful solution. I deacrease my batch size to 1.

DevLob-zz commented 4 years ago

i was suffering from the same issue for a many weeks and finally

i got that

i ran using window10 and windows server using powershell " .\nvidia-smi -q -i 0 -d SUPPORTED_CLOCKS" and see that NVidia driver is using Cuda 10.2 when try to downgrade my NVIDIA DRIVER to Cuda 10.1 or CUDA 10.0 it finally worked seem there is an issue with Supported CUDA10.2
image (8)

tensorflow / tensor2tensor

Transformer training stuck at the beginning of the run #1120

Description

Environment information

1088