liyaguang / DCRNN

Implementation of Diffusion Convolutional Recurrent Neural Network in Tensorflow
MIT License
1.22k stars 400 forks source link

Training hangs #5

Closed pbalapra closed 6 years ago

pbalapra commented 6 years ago

Yaguang,

Training starts
python dcrnn_train.py --config_filename=data/model/dcrnn_config.yaml

but it hangs after that. We also tried it on GPU but found the same issue.

2018-07-26 12:08:16,158 - INFO - Log directory: data/model 2018-07-26 12:08:16,158 - INFO - Loading graph from: data/sensor_graph/adj_mx.pkl 2018-07-26 12:08:16,160 - INFO - Loading traffic data from: data/df_highway_2012_4mon_sample.h5 2018-07-26 12:08:16.407358: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2018-07-26 12:08:16.407392: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2018-07-26 12:08:16.407399: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2018-07-26 12:08:16.407405: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. 2018-07-26 12:08:16,409 - INFO - Log directory: data/model/dcrnn_DR_2_h_12_64-64_lr_0.01_bs_64_d_0.00_sl_12_MAE_0726120816/ 2018-07-26 12:08:16,410 - INFO - {'base_dir': 'data/model', 'batch_size': 64, 'cl_decay_steps': 2000, 'data_type': 'ALL', 'dropout': 0, 'epoch': 0, 'epochs': 100, 'filter_type': 'dual_random_walk', 'global_step': 0, 'graph_pkl_filename': 'data/sensor_graph/adj_mx.pkl', 'horizon': 12, 'l1_decay': 0, 'learning_rate': 0.01, 'loss_func': 'MAE', 'lr_decay': 0.1, 'lr_decay_epoch': 20, 'lr_decay_interval': 10, 'max_diffusion_step': 2, 'max_grad_norm': 5, 'min_learning_rate': 2e-06, 'null_val': 0, 'num_rnn_layers': 2, 'output_dim': 1, 'patience': 50, 'rnn_units': 64, 'seq_len': 12, 'test_every_n_epochs': 10, 'test_ratio': 0.2, 'use_cpu_only': False, 'use_curriculum_learning': True, 'validation_ratio': 0.1, 'verbose': 0, 'write_db': False} 2018-07-26 12:08:37,392 - INFO - Total number of trainable parameters: 373312 2018-07-26 12:08:37,392 - DEBUG - DCRNN/learning_rate:0, () 2018-07-26 12:08:37,392 - DEBUG - DCRNN/global_step:0, () 2018-07-26 12:08:37,392 - DEBUG - DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/gates/weights:0, (330, 128) 2018-07-26 12:08:37,392 - DEBUG - DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/gates/biases:0, (128,) 2018-07-26 12:08:37,392 - DEBUG - DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/candidate/weights:0, (330, 64) 2018-07-26 12:08:37,393 - DEBUG - DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/candidate/biases:0, (64,) 2018-07-26 12:08:37,393 - DEBUG - DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/gates/weights:0, (640, 128) 2018-07-26 12:08:37,393 - DEBUG - DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/gates/biases:0, (128,) 2018-07-26 12:08:37,393 - DEBUG - DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/candidate/weights:0, (640, 64) 2018-07-26 12:08:37,393 - DEBUG - DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/candidate/biases:0, (64,) 2018-07-26 12:08:37,393 - DEBUG - DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/gates/weights:0, (330, 128) 2018-07-26 12:08:37,394 - DEBUG - DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/gates/biases:0, (128,) 2018-07-26 12:08:37,394 - DEBUG - DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/candidate/weights:0, (330, 64) 2018-07-26 12:08:37,394 - DEBUG - DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/candidate/biases:0, (64,) 2018-07-26 12:08:37,394 - DEBUG - DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/gates/weights:0, (640, 128) 2018-07-26 12:08:37,394 - DEBUG - DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/gates/biases:0, (128,) 2018-07-26 12:08:37,394 - DEBUG - DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/candidate/weights:0, (640, 64) 2018-07-26 12:08:37,395 - DEBUG - DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/candidate/biases:0, (64,) 2018-07-26 12:08:37,395 - DEBUG - DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/projection/w:0, (64, 1) 2018-07-26 12:08:37,395 - DEBUG - Train/DCRNN/beta1_power:0, () 2018-07-26 12:08:37,395 - DEBUG - Train/DCRNN/beta2_power:0, () 2018-07-26 12:08:37,395 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/gates/weights/Adam:0, (330, 128) 2018-07-26 12:08:37,395 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/gates/weights/Adam_1:0, (330, 128) 2018-07-26 12:08:37,395 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/gates/biases/Adam:0, (128,) 2018-07-26 12:08:37,396 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/gates/biases/Adam_1:0, (128,) 2018-07-26 12:08:37,396 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/candidate/weights/Adam:0, (330, 64) 2018-07-26 12:08:37,396 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/candidate/weights/Adam_1:0, (330, 64) 2018-07-26 12:08:37,396 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/candidate/biases/Adam:0, (64,) 2018-07-26 12:08:37,396 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_0/dcgru_cell/candidate/biases/Adam_1:0, (64,) 2018-07-26 12:08:37,396 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/gates/weights/Adam:0, (640, 128) 2018-07-26 12:08:37,396 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/gates/weights/Adam_1:0, (640, 128) 2018-07-26 12:08:37,397 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/gates/biases/Adam:0, (128,) 2018-07-26 12:08:37,397 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/gates/biases/Adam_1:0, (128,) 2018-07-26 12:08:37,397 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/candidate/weights/Adam:0, (640, 64) 2018-07-26 12:08:37,397 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/candidate/weights/Adam_1:0, (640, 64) 2018-07-26 12:08:37,397 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/candidate/biases/Adam:0, (64,) 2018-07-26 12:08:37,397 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn/multi_rnn_cell/cell_1/dcgru_cell/candidate/biases/Adam_1:0, (64,) 2018-07-26 12:08:37,398 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/gates/weights/Adam:0, (330, 128) 2018-07-26 12:08:37,398 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/gates/weights/Adam_1:0, (330, 128) 2018-07-26 12:08:37,398 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/gates/biases/Adam:0, (128,) 2018-07-26 12:08:37,398 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/gates/biases/Adam_1:0, (128,) 2018-07-26 12:08:37,398 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/candidate/weights/Adam:0, (330, 64) 2018-07-26 12:08:37,398 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/candidate/weights/Adam_1:0, (330, 64) 2018-07-26 12:08:37,399 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/candidate/biases/Adam:0, (64,) 2018-07-26 12:08:37,399 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_0/dcgru_cell/candidate/biases/Adam_1:0, (64,) 2018-07-26 12:08:37,399 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/gates/weights/Adam:0, (640, 128) 2018-07-26 12:08:37,399 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/gates/weights/Adam_1:0, (640, 128) 2018-07-26 12:08:37,399 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/gates/biases/Adam:0, (128,) 2018-07-26 12:08:37,399 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/gates/biases/Adam_1:0, (128,) 2018-07-26 12:08:37,400 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/candidate/weights/Adam:0, (640, 64) 2018-07-26 12:08:37,400 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/candidate/weights/Adam_1:0, (640, 64) 2018-07-26 12:08:37,400 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/candidate/biases/Adam:0, (64,) 2018-07-26 12:08:37,400 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/candidate/biases/Adam_1:0, (64,) 2018-07-26 12:08:37,400 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/projection/w/Adam:0, (64, 1) 2018-07-26 12:08:37,400 - DEBUG - DCRNN/DCRNN/DCRNN_SEQ/rnn_decoder/multi_rnn_cell/cell_1/dcgru_cell/projection/w/Adam_1:0, (64, 1)

the training hangs here.

liyaguang commented 6 years ago

Hi Prasanna,

Do you mean that the program exit with error? If so, can you provide additional error message?

Besides, log information will be printed every epoch, and you may have to wait a few minutes or longer (depends on training resource) to the see the new message.

pbalapra commented 6 years ago

Hi Yaguang, You are right. It takes 4 mins after that step to start Epoch 0 on Tesla P100. Thanks a lot for your quick response.