tensorflow / nmt

TensorFlow Neural Machine Translation Tutorial
Apache License 2.0
6.39k stars 1.96k forks source link

Segmentation fault (core dumped) while training Bodo (brx) to English (en) #320

Open sanjibnarzary opened 6 years ago

sanjibnarzary commented 6 years ago

My System Configurations

CUDA: 9.1 libCUDNN 7.1 Tensorflow Version: '1.8.0-rc0'

The system works for default Vietnam to English dataset but while training with Bodo English dataset the segmentation core dump problem occurs.

I tried with various batch_sizes like 512, 128, 64, 16, 8, 4 but no luck. Here is my log file of training

Log

python -m nmt.nmt --attention=scaled_luong --src=brx --tgt=en --vocab_prefix=/tmp/nmt_data/vocab --train_prefix=/tmp/nmt_data/train --dev_prefix=/tmp/nmt_data/tst2012 --test_prefix=/tmp/nmt_data/tst2013 --out_dir=/tmp/nmt_attention_model --num_train_steps=12000 --steps_per_stats=100 --num_layers=2 --num_units=128 --dropout=0.2 --metrics=bleu --batch_size=64

Job id 0

# Loading hparams from /tmp/nmt_attention_model/hparams
  saving hparams to /tmp/nmt_attention_model/hparams
  saving hparams to /tmp/nmt_attention_model/best_bleu/hparams
  attention=scaled_luong
  attention_architecture=standard
  avg_ckpts=False
  batch_size=32
  beam_width=0
  best_bleu=0
  best_bleu_dir=/tmp/nmt_attention_model/best_bleu
  check_special_token=True
  colocate_gradients_with_ops=True
  decay_scheme=
  dev_prefix=/tmp/nmt_data/tst2012
  dropout=0.2
  embed_prefix=None
  encoder_type=uni
  eos=</s>
  epoch_step=0
  forget_bias=1.0
  infer_batch_size=32
  init_op=uniform
  init_weight=0.1
  learning_rate=1.0
  length_penalty_weight=0.0
  log_device_placement=False
  max_gradient_norm=5.0
  max_train=0
  metrics=['bleu']
  num_buckets=5
  num_decoder_layers=2
  num_decoder_residual_layers=0
  num_embeddings_partitions=0
  num_encoder_layers=2
  num_encoder_residual_layers=0
  num_gpus=1
  num_inter_threads=0
  num_intra_threads=0
  num_keep_ckpts=5
  num_layers=2
  num_train_steps=12000
  num_translations_per_input=1
  num_units=128
  optimizer=sgd
  out_dir=/tmp/nmt_attention_model
  output_attention=True
  override_loaded_hparams=False
  pass_hidden_state=True
  random_seed=None
  residual=False
  sampling_temperature=0.0
  share_vocab=False
  sos=<,s>
  src=brx
  src_embed_file=
  src_max_len=50
  src_max_len_infer=None
  src_vocab_file=/tmp/nmt_attention_model/vocab.brx
  src_vocab_size=50003
  steps_per_external_eval=None
  steps_per_stats=100
  subword_option=
  test_prefix=/tmp/nmt_data/tst2013
  tgt=en
  tgt_embed_file=
  tgt_max_len=50
  tgt_max_len_infer=None
  tgt_vocab_file=/tmp/nmt_attention_model/vocab.en
  tgt_vocab_size=33366
  time_major=True
  train_prefix=/tmp/nmt_data/train
  unit_type=lstm
  vocab_prefix=/tmp/nmt_data/vocab
  warmup_scheme=t2t
  warmup_steps=0
# creating train graph ...
  num_layers = 2, num_residual_layers=0
  cell 0  LSTM, forget_bias=1  DropoutWrapper, dropout=0.2   DeviceWrapper, device=/gpu:0
  cell 1  LSTM, forget_bias=1  DropoutWrapper, dropout=0.2   DeviceWrapper, device=/gpu:0
  cell 0  LSTM, forget_bias=1  DropoutWrapper, dropout=0.2   DeviceWrapper, device=/gpu:0
  cell 1  LSTM, forget_bias=1  DropoutWrapper, dropout=0.2   DeviceWrapper, device=/gpu:0
  learning_rate=1, warmup_steps=0, warmup_scheme=t2t
  decay_scheme=, start_decay_step=12000, decay_steps 0, decay_factor 1
# Trainable variables
  embeddings/encoder/embedding_encoder:0, (50003, 128), /device:CPU:0
  embeddings/decoder/embedding_decoder:0, (33366, 128), /device:GPU:0
  dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0, (256, 512), /device:GPU:0
  dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0, (512,), /device:GPU:0
  dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0, (256, 512), /device:GPU:0
  dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0, (512,), /device:GPU:0
  dynamic_seq2seq/decoder/memory_layer/kernel:0, (128, 128), 
  dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0, (384, 512), /device:GPU:0
  dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0, (512,), /device:GPU:0
  dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0, (256, 512), /device:GPU:0
  dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0, (512,), /device:GPU:0
  dynamic_seq2seq/decoder/attention/luong_attention/attention_g:0, (), /device:GPU:0
  dynamic_seq2seq/decoder/attention/attention_layer/kernel:0, (256, 128), /device:GPU:0
  dynamic_seq2seq/decoder/output_projection/kernel:0, (128, 33366), 
# creating eval graph ...
  num_layers = 2, num_residual_layers=0
  cell 0  LSTM, forget_bias=1  DeviceWrapper, device=/gpu:0
  cell 1  LSTM, forget_bias=1  DeviceWrapper, device=/gpu:0
  cell 0  LSTM, forget_bias=1  DeviceWrapper, device=/gpu:0
  cell 1  LSTM, forget_bias=1  DeviceWrapper, device=/gpu:0
# Trainable variables
  embeddings/encoder/embedding_encoder:0, (50003, 128), /device:CPU:0
  embeddings/decoder/embedding_decoder:0, (33366, 128), /device:GPU:0
  dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0, (256, 512), /device:GPU:0
  dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0, (512,), /device:GPU:0
  dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0, (256, 512), /device:GPU:0
  dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0, (512,), /device:GPU:0
  dynamic_seq2seq/decoder/memory_layer/kernel:0, (128, 128), 
  dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0, (384, 512), /device:GPU:0
  dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0, (512,), /device:GPU:0
  dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0, (256, 512), /device:GPU:0
  dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0, (512,), /device:GPU:0
  dynamic_seq2seq/decoder/attention/luong_attention/attention_g:0, (), /device:GPU:0
  dynamic_seq2seq/decoder/attention/attention_layer/kernel:0, (256, 128), /device:GPU:0
  dynamic_seq2seq/decoder/output_projection/kernel:0, (128, 33366), 
# creating infer graph ...
  num_layers = 2, num_residual_layers=0
  cell 0  LSTM, forget_bias=1  DeviceWrapper, device=/gpu:0
  cell 1  LSTM, forget_bias=1  DeviceWrapper, device=/gpu:0
  cell 0  LSTM, forget_bias=1  DeviceWrapper, device=/gpu:0
  cell 1  LSTM, forget_bias=1  DeviceWrapper, device=/gpu:0
# Trainable variables
  embeddings/encoder/embedding_encoder:0, (50003, 128), /device:CPU:0
  embeddings/decoder/embedding_decoder:0, (33366, 128), /device:GPU:0
  dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0, (256, 512), /device:GPU:0
  dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0, (512,), /device:GPU:0
  dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0, (256, 512), /device:GPU:0
  dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0, (512,), /device:GPU:0
  dynamic_seq2seq/decoder/memory_layer/kernel:0, (128, 128), 
  dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0, (384, 512), /device:GPU:0
  dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0, (512,), /device:GPU:0
  dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0, (256, 512), /device:GPU:0
  dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0, (512,), /device:GPU:0
  dynamic_seq2seq/decoder/attention/luong_attention/attention_g:0, (), /device:GPU:0
  dynamic_seq2seq/decoder/attention/attention_layer/kernel:0, (256, 128), /device:GPU:0
  dynamic_seq2seq/decoder/output_projection/kernel:0, (128, 33366), 
# log_file=/tmp/nmt_attention_model/log_1525010798
2018-04-29 19:36:40.116752: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:3b:00.0
totalMemory: 15.77GiB freeMemory: 15.35GiB
2018-04-29 19:36:40.116798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-04-29 19:36:40.373220: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-29 19:36:40.373255: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-04-29 19:36:40.373260: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-04-29 19:36:40.373525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14866 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
2018-04-29 19:36:40.381329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-04-29 19:36:40.381406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-29 19:36:40.381425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-04-29 19:36:40.381440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-04-29 19:36:40.381992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14866 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
2018-04-29 19:36:40.382475: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-04-29 19:36:40.382512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-29 19:36:40.382529: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-04-29 19:36:40.382543: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-04-29 19:36:40.383018: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14866 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:3b:00.0, compute capability: 7.0)
  created train model with fresh parameters, time 0.31s
  created infer model with fresh parameters, time 0.18s
  # 965
    src: जाम्बानि न लुनाय लुथाया , बेनि गोदोनि खुंगिरिफोरनि थासारि रिफिखां होदों ।
    ref: the architecture of the buildings of Chamba reflects the aura of its former rulers .
    nmt: Valleswaram    1 Bhutanatha    1 Bhutanatha    1 'nendrankai   1 'nendrankai   1 say   38 say  38 say  38 say  38 catered      1 catered       1 place 683 Pataliputra 2 charm 92 charm        92 charm        92 charm        92 charm        92 charm92 Ghiyas-ud-din-Khilji 1 Ghiyas-ud-din-Khilji  1 Good  4 Good  4 Panchtarni    2
  created eval model with fresh parameters, time 0.23s
  eval dev: perplexity 33015.80, time 4s, Sun Apr 29 19:36:46 2018.
  eval test: perplexity 33013.43, time 4s, Sun Apr 29 19:36:51 2018.
2018-04-29 19:36:51.722019: I tensorflow/core/kernels/lookup_util.cc:373] Table trying to initialize from file /tmp/nmt_attention_model/vocab.en is already initialized.
2018-04-29 19:36:51.722152: I tensorflow/core/kernels/lookup_util.cc:373] Table trying to initialize from file /tmp/nmt_attention_model/vocab.en is already initialized.
2018-04-29 19:36:51.722826: I tensorflow/core/kernels/lookup_util.cc:373] Table trying to initialize from file /tmp/nmt_attention_model/vocab.brx is already initialized.
  created infer model with fresh parameters, time 0.14s
# Start step 0, lr 1, Sun Apr 29 19:36:51 2018
# Init train iterator, skipping 0 elements
Segmentation fault (core dumped)
jigyasa06 commented 6 years ago

did you got any solution ?? i am also facing the same problem