Closed pbalapra closed 6 years ago
Hi Prasanna,
Do you mean that the program exit with error? If so, can you provide additional error message?
Besides, log information will be printed every epoch, and you may have to wait a few minutes or longer (depends on training resource) to the see the new message.
Hi Yaguang, You are right. It takes 4 mins after that step to start Epoch 0 on Tesla P100. Thanks a lot for your quick response.
Yaguang,
Training starts
python dcrnn_train.py --config_filename=data/model/dcrnn_config.yaml
but it hangs after that. We also tried it on GPU but found the same issue.
2018-07-26 12:08:16,158 - INFO - Log directory: data/model 2018-07-26 12:08:16,158 - INFO - Loading graph from: data/sensor_graph/adj_mx.pkl 2018-07-26 12:08:16,160 - INFO - Loading traffic data from: data/df_highway_2012_4mon_sample.h5 2018-07-26 12:08:16.407358: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2018-07-26 12:08:16.407392: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2018-07-26 12:08:16.407399: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2018-07-26 12:08:16.407405: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. 2018-07-26 12:08:16,409 - INFO - Log directory: data/model/dcrnn_DR_2_h_12_64-64_lr_0.01_bs_64_d_0.00_sl_12_MAE_0726120816/ 2018-07-26 12:08:16,410 - INFO - {'base_dir': 'data/model', 'batch_size': 64, 'cl_decay_steps': 2000, 'data_type': 'ALL', 'dropout': 0, 'epoch': 0, 'epochs': 100, 'filter_type': 'dual_random_walk', 'global_step': 0, 'graph_pkl_filename': 'data/sensor_graph/adj_mx.pkl', 'horizon': 12, 'l1_decay': 0, 'learning_rate': 0.01, 'loss_func': 'MAE', 'lr_decay': 0.1, 'lr_decay_epoch': 20, 'lr_decay_interval': 10, 'max_diffusion_step': 2, 'max_grad_norm': 5, 'min_learning_rate': 2e-06, 'null_val': 0, 'num_rnn_layers': 2, 'output_dim': 1, 'patience': 50, 'rnn_units': 64, 'seq_len': 12, 'test_every_n_epochs': 10, 'test_ratio': 0.2, 'use_cpu_only': False, 'use_curriculum_learning': True, 'validation_ratio': 0.1, 'verbose': 0, 'write_db': False} 2018-07-26 12:08:37,392 - INFO - Total number of trainable parameters: 373312 2018-07-26 12:08:37,392 - DEBUG - DCRNN/learning_rate:0, () 2018-07-26 12:08:37,392 - DEBUG - DCRNN/global_step:0, () 2018-07-26 12:08:37,392 - DEBUG - DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/gates/weights:0, (330, 128) 2018-07-26 12:08:37,392 - DEBUG - DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/gates/biases:0, (128,) 2018-07-26 12:08:37,392 - DEBUG - DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/candidate/weights:0, (330, 64) 2018-07-26 12:08:37,393 - DEBUG - DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/candidate/biases:0, (64,) 2018-07-26 12:08:37,393 - DEBUG - DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/gates/weights:0, (640, 128) 2018-07-26 12:08:37,393 - DEBUG - DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/gates/biases:0, (128,) 2018-07-26 12:08:37,393 - DEBUG - DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/candidate/weights:0, (640, 64) 2018-07-26 12:08:37,393 - DEBUG - DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/candidate/biases:0, (64,) 2018-07-26 12:08:37,393 - DEBUG - DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/gates/weights:0, (330, 128) 2018-07-26 12:08:37,394 - DEBUG - DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/gates/biases:0, (128,) 2018-07-26 12:08:37,394 - DEBUG - DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/candidate/weights:0, (330, 64) 2018-07-26 12:08:37,394 - DEBUG - DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/candidate/biases:0, (64,) 2018-07-26 12:08:37,394 - DEBUG - DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/gates/weights:0, (640, 128) 2018-07-26 12:08:37,394 - DEBUG - DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/gates/biases:0, (128,) 2018-07-26 12:08:37,394 - DEBUG - DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/candidate/weights:0, (640, 64) 2018-07-26 12:08:37,395 - DEBUG - DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/candidate/biases:0, (64,) 2018-07-26 12:08:37,395 - DEBUG - DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/projection/w:0, (64, 1) 2018-07-26 12:08:37,395 - DEBUG - Train/DCRNN/beta1_power:0, () 2018-07-26 12:08:37,395 - DEBUG - Train/DCRNN/beta2_power:0, () 2018-07-26 12:08:37,395 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/gates/weights/Adam:0, (330, 128) 2018-07-26 12:08:37,395 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/gates/weights/Adam_1:0, (330, 128) 2018-07-26 12:08:37,395 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/gates/biases/Adam:0, (128,) 2018-07-26 12:08:37,396 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/gates/biases/Adam_1:0, (128,) 2018-07-26 12:08:37,396 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/candidate/weights/Adam:0, (330, 64) 2018-07-26 12:08:37,396 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/candidate/weights/Adam_1:0, (330, 64) 2018-07-26 12:08:37,396 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/candidate/biases/Adam:0, (64,) 2018-07-26 12:08:37,396 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/candidate/biases/Adam_1:0, (64,) 2018-07-26 12:08:37,396 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/gates/weights/Adam:0, (640, 128) 2018-07-26 12:08:37,396 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/gates/weights/Adam_1:0, (640, 128) 2018-07-26 12:08:37,397 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/gates/biases/Adam:0, (128,) 2018-07-26 12:08:37,397 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/gates/biases/Adam_1:0, (128,) 2018-07-26 12:08:37,397 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/candidate/weights/Adam:0, (640, 64) 2018-07-26 12:08:37,397 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/candidate/weights/Adam_1:0, (640, 64) 2018-07-26 12:08:37,397 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/candidate/biases/Adam:0, (64,) 2018-07-26 12:08:37,397 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/candidate/biases/Adam_1:0, (64,) 2018-07-26 12:08:37,398 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/gates/weights/Adam:0, (330, 128) 2018-07-26 12:08:37,398 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/gates/weights/Adam_1:0, (330, 128) 2018-07-26 12:08:37,398 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/gates/biases/Adam:0, (128,) 2018-07-26 12:08:37,398 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/gates/biases/Adam_1:0, (128,) 2018-07-26 12:08:37,398 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/candidate/weights/Adam:0, (330, 64) 2018-07-26 12:08:37,398 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/candidate/weights/Adam_1:0, (330, 64) 2018-07-26 12:08:37,399 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/candidate/biases/Adam:0, (64,) 2018-07-26 12:08:37,399 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/candidate/biases/Adam_1:0, (64,) 2018-07-26 12:08:37,399 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/gates/weights/Adam:0, (640, 128) 2018-07-26 12:08:37,399 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/gates/weights/Adam_1:0, (640, 128) 2018-07-26 12:08:37,399 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/gates/biases/Adam:0, (128,) 2018-07-26 12:08:37,399 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/gates/biases/Adam_1:0, (128,) 2018-07-26 12:08:37,400 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/candidate/weights/Adam:0, (640, 64) 2018-07-26 12:08:37,400 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/candidate/weights/Adam_1:0, (640, 64) 2018-07-26 12:08:37,400 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/candidate/biases/Adam:0, (64,) 2018-07-26 12:08:37,400 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/candidate/biases/Adam_1:0, (64,) 2018-07-26 12:08:37,400 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/projection/w/Adam:0, (64, 1) 2018-07-26 12:08:37,400 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/projection/w/Adam_1:0, (64, 1)
the training hangs here.