Open sanjibnarzary opened 6 years ago
CUDA: 9.1 libCUDNN 7.1 Tensorflow Version: '1.8.0-rc0'
The system works for default Vietnam to English dataset but while training with Bodo English dataset the segmentation core dump problem occurs.
I tried with various batch_sizes like 512, 128, 64, 16, 8, 4 but no luck. Here is my log file of training
python -m nmt.nmt --attention=scaled_luong --src=brx --tgt=en --vocab_prefix=/tmp/nmt_data/vocab --train_prefix=/tmp/nmt_data/train --dev_prefix=/tmp/nmt_data/tst2012 --test_prefix=/tmp/nmt_data/tst2013 --out_dir=/tmp/nmt_attention_model --num_train_steps=12000 --steps_per_stats=100 --num_layers=2 --num_units=128 --dropout=0.2 --metrics=bleu --batch_size=64
# Loading hparams from /tmp/nmt_attention_model/hparams saving hparams to /tmp/nmt_attention_model/hparams saving hparams to /tmp/nmt_attention_model/best_bleu/hparams attention=scaled_luong attention_architecture=standard avg_ckpts=False batch_size=32 beam_width=0 best_bleu=0 best_bleu_dir=/tmp/nmt_attention_model/best_bleu check_special_token=True colocate_gradients_with_ops=True decay_scheme= dev_prefix=/tmp/nmt_data/tst2012 dropout=0.2 embed_prefix=None encoder_type=uni eos=</s> epoch_step=0 forget_bias=1.0 infer_batch_size=32 init_op=uniform init_weight=0.1 learning_rate=1.0 length_penalty_weight=0.0 log_device_placement=False max_gradient_norm=5.0 max_train=0 metrics=['bleu'] num_buckets=5 num_decoder_layers=2 num_decoder_residual_layers=0 num_embeddings_partitions=0 num_encoder_layers=2 num_encoder_residual_layers=0 num_gpus=1 num_inter_threads=0 num_intra_threads=0 num_keep_ckpts=5 num_layers=2 num_train_steps=12000 num_translations_per_input=1 num_units=128 optimizer=sgd out_dir=/tmp/nmt_attention_model output_attention=True override_loaded_hparams=False pass_hidden_state=True random_seed=None residual=False sampling_temperature=0.0 share_vocab=False sos=<,s> src=brx src_embed_file= src_max_len=50 src_max_len_infer=None src_vocab_file=/tmp/nmt_attention_model/vocab.brx src_vocab_size=50003 steps_per_external_eval=None steps_per_stats=100 subword_option= test_prefix=/tmp/nmt_data/tst2013 tgt=en tgt_embed_file= tgt_max_len=50 tgt_max_len_infer=None tgt_vocab_file=/tmp/nmt_attention_model/vocab.en tgt_vocab_size=33366 time_major=True train_prefix=/tmp/nmt_data/train unit_type=lstm vocab_prefix=/tmp/nmt_data/vocab warmup_scheme=t2t warmup_steps=0 # creating train graph ... num_layers = 2, num_residual_layers=0 cell 0 LSTM, forget_bias=1 DropoutWrapper, dropout=0.2 DeviceWrapper, device=/gpu:0 cell 1 LSTM, forget_bias=1 DropoutWrapper, dropout=0.2 DeviceWrapper, device=/gpu:0 cell 0 LSTM, forget_bias=1 DropoutWrapper, dropout=0.2 DeviceWrapper, device=/gpu:0 cell 1 LSTM, forget_bias=1 DropoutWrapper, dropout=0.2 DeviceWrapper, device=/gpu:0 learning_rate=1, warmup_steps=0, warmup_scheme=t2t decay_scheme=, start_decay_step=12000, decay_steps 0, decay_factor 1 # Trainable variables embeddings/encoder/embedding_encoder:0, (50003, 128), /device:CPU:0 embeddings/decoder/embedding_decoder:0, (33366, 128), /device:GPU:0 dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0, (256, 512), /device:GPU:0 dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0, (512,), /device:GPU:0 dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0, (256, 512), /device:GPU:0 dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0, (512,), /device:GPU:0 dynamic_seq2seq/decoder/memory_layer/kernel:0, (128, 128), dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0, (384, 512), /device:GPU:0 dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0, (512,), /device:GPU:0 dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0, (256, 512), /device:GPU:0 dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0, (512,), /device:GPU:0 dynamic_seq2seq/decoder/attention/luong_attention/attention_g:0, (), /device:GPU:0 dynamic_seq2seq/decoder/attention/attention_layer/kernel:0, (256, 128), /device:GPU:0 dynamic_seq2seq/decoder/output_projection/kernel:0, (128, 33366), # creating eval graph ... num_layers = 2, num_residual_layers=0 cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0 cell 1 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0 cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0 cell 1 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0 # Trainable variables embeddings/encoder/embedding_encoder:0, (50003, 128), /device:CPU:0 embeddings/decoder/embedding_decoder:0, (33366, 128), /device:GPU:0 dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0, (256, 512), /device:GPU:0 dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0, (512,), /device:GPU:0 dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0, (256, 512), /device:GPU:0 dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0, (512,), /device:GPU:0 dynamic_seq2seq/decoder/memory_layer/kernel:0, (128, 128), dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0, (384, 512), /device:GPU:0 dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0, (512,), /device:GPU:0 dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0, (256, 512), /device:GPU:0 dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0, (512,), /device:GPU:0 dynamic_seq2seq/decoder/attention/luong_attention/attention_g:0, (), /device:GPU:0 dynamic_seq2seq/decoder/attention/attention_layer/kernel:0, (256, 128), /device:GPU:0 dynamic_seq2seq/decoder/output_projection/kernel:0, (128, 33366), # creating infer graph ... num_layers = 2, num_residual_layers=0 cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0 cell 1 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0 cell 0 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0 cell 1 LSTM, forget_bias=1 DeviceWrapper, device=/gpu:0 # Trainable variables embeddings/encoder/embedding_encoder:0, (50003, 128), /device:CPU:0 embeddings/decoder/embedding_decoder:0, (33366, 128), /device:GPU:0 dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0, (256, 512), /device:GPU:0 dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0, (512,), /device:GPU:0 dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0, (256, 512), /device:GPU:0 dynamic_seq2seq/encoder/rnn/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0, (512,), /device:GPU:0 dynamic_seq2seq/decoder/memory_layer/kernel:0, (128, 128), dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0, (384, 512), /device:GPU:0 dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_0/basic_lstm_cell/bias:0, (512,), /device:GPU:0 dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/kernel:0, (256, 512), /device:GPU:0 dynamic_seq2seq/decoder/attention/multi_rnn_cell/cell_1/basic_lstm_cell/bias:0, (512,), /device:GPU:0 dynamic_seq2seq/decoder/attention/luong_attention/attention_g:0, (), /device:GPU:0 dynamic_seq2seq/decoder/attention/attention_layer/kernel:0, (256, 128), /device:GPU:0 dynamic_seq2seq/decoder/output_projection/kernel:0, (128, 33366), # log_file=/tmp/nmt_attention_model/log_1525010798 2018-04-29 19:36:40.116752: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38 pciBusID: 0000:3b:00.0 totalMemory: 15.77GiB freeMemory: 15.35GiB 2018-04-29 19:36:40.116798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0 2018-04-29 19:36:40.373220: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-04-29 19:36:40.373255: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0 2018-04-29 19:36:40.373260: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N 2018-04-29 19:36:40.373525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14866 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:3b:00.0, compute capability: 7.0) 2018-04-29 19:36:40.381329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0 2018-04-29 19:36:40.381406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-04-29 19:36:40.381425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0 2018-04-29 19:36:40.381440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N 2018-04-29 19:36:40.381992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14866 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:3b:00.0, compute capability: 7.0) 2018-04-29 19:36:40.382475: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0 2018-04-29 19:36:40.382512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-04-29 19:36:40.382529: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0 2018-04-29 19:36:40.382543: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N 2018-04-29 19:36:40.383018: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14866 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:3b:00.0, compute capability: 7.0) created train model with fresh parameters, time 0.31s created infer model with fresh parameters, time 0.18s # 965 src: जाम्बानि न लुनाय लुथाया , बेनि गोदोनि खुंगिरिफोरनि थासारि रिफिखां होदों । ref: the architecture of the buildings of Chamba reflects the aura of its former rulers . nmt: Valleswaram 1 Bhutanatha 1 Bhutanatha 1 'nendrankai 1 'nendrankai 1 say 38 say 38 say 38 say 38 catered 1 catered 1 place 683 Pataliputra 2 charm 92 charm 92 charm 92 charm 92 charm 92 charm92 Ghiyas-ud-din-Khilji 1 Ghiyas-ud-din-Khilji 1 Good 4 Good 4 Panchtarni 2 created eval model with fresh parameters, time 0.23s eval dev: perplexity 33015.80, time 4s, Sun Apr 29 19:36:46 2018. eval test: perplexity 33013.43, time 4s, Sun Apr 29 19:36:51 2018. 2018-04-29 19:36:51.722019: I tensorflow/core/kernels/lookup_util.cc:373] Table trying to initialize from file /tmp/nmt_attention_model/vocab.en is already initialized. 2018-04-29 19:36:51.722152: I tensorflow/core/kernels/lookup_util.cc:373] Table trying to initialize from file /tmp/nmt_attention_model/vocab.en is already initialized. 2018-04-29 19:36:51.722826: I tensorflow/core/kernels/lookup_util.cc:373] Table trying to initialize from file /tmp/nmt_attention_model/vocab.brx is already initialized. created infer model with fresh parameters, time 0.14s # Start step 0, lr 1, Sun Apr 29 19:36:51 2018 # Init train iterator, skipping 0 elements Segmentation fault (core dumped)
did you got any solution ?? i am also facing the same problem
My System Configurations
CUDA: 9.1 libCUDNN 7.1 Tensorflow Version: '1.8.0-rc0'
The system works for default Vietnam to English dataset but while training with Bodo English dataset the segmentation core dump problem occurs.
I tried with various batch_sizes like 512, 128, 64, 16, 8, 4 but no luck. Here is my log file of training
Log
python -m nmt.nmt --attention=scaled_luong --src=brx --tgt=en --vocab_prefix=/tmp/nmt_data/vocab --train_prefix=/tmp/nmt_data/train --dev_prefix=/tmp/nmt_data/tst2012 --test_prefix=/tmp/nmt_data/tst2013 --out_dir=/tmp/nmt_attention_model --num_train_steps=12000 --steps_per_stats=100 --num_layers=2 --num_units=128 --dropout=0.2 --metrics=bleu --batch_size=64
Job id 0