Fail to reproduce best performance

kyungeuuun commented 5 years ago

Hello, Li. Though I run 'dcrnn-train.py' with the parameter setup as you mentioned in the paper, I failed to reproduce the best performance. Could you please explain my mistakes or detailed options?

2019-02-22 16:58:40,796 - INFO - Log directory: data/model 2019-02-22 16:58:40,797 - INFO - {'data': {'val_batch_size': 64, 'test_batch_size': 64, 'batch_size': 64, 'graph_pkl_filename': 'data/sensor_graph/dcrnn/adj_mx.pkl', 'dataset_dir': 'data/METR-LA'}, 'model': {'cl_decay_steps': 2000, 'input_dim': 2, 'l1_decay': 0, 'num_rnn_layers': 2, 'num_nodes': 207, 'filter_type': 'dual_random_walk', 'horizon': 12, 'use_curriculum_learning': True, 'seq_len': 12, 'rnn_units': 64, 'output_dim': 1, 'max_diffusion_step': 3}, 'train': {'optimizer': 'adam', 'epsilon': 0.001, 'dropout': 0, 'model_filename': None, 'epochs': 100, 'patience': 50, 'base_lr': 0.01, 'max_grad_norm': 5, 'min_learning_rate': 2e-06, 'global_step': 0, 'max_to_keep': 100, 'lr_decay_ratio': 0.1, 'epoch': 0, 'test_every_n_epochs': 10, 'steps': [20, 30, 40, 50], 'log_dir': 'data/model'}, 'log_level': 'INFO', 'base_dir': 'data/model'} 2019-02-22 16:58:49,720 - INFO - ('x_val', (3425, 12, 207, 2)) 2019-02-22 16:58:49,720 - INFO - ('x_train', (23974, 12, 207, 2)) 2019-02-22 16:58:49,720 - INFO - ('x_test', (6850, 12, 207, 2)) 2019-02-22 16:58:49,720 - INFO - ('y_val', (3425, 12, 207, 2)) 2019-02-22 16:58:49,720 - INFO - ('y_train', (23974, 12, 207, 2)) 2019-02-22 16:58:49,720 - INFO - ('y_test', (6850, 12, 207, 2)) 2019-02-22 16:59:06,917 - INFO - Total number of trainable parameters: 520960 2019-02-22 16:59:09,019 - INFO - Start training ... ... 2019-02-23 04:12:31,358 - INFO - Epoch [89/100] (0) train_mae: 9.8364, val_mae: 12.8458 lr:0.000002 431.2s 2019-02-23 04:13:29,147 - INFO - Horizon 01, MAE: 13.55, MAPE: 0.3397, RMSE: 16.15 2019-02-23 04:13:29,213 - INFO - Horizon 02, MAE: 12.81, MAPE: 0.3336, RMSE: 15.54 2019-02-23 04:13:29,277 - INFO - Horizon 03, MAE: 12.34, MAPE: 0.3307, RMSE: 15.23 2019-02-23 04:13:29,340 - INFO - Horizon 04, MAE: 12.15, MAPE: 0.3311, RMSE: 15.21 2019-02-23 04:13:29,405 - INFO - Horizon 05, MAE: 12.20, MAPE: 0.3341, RMSE: 15.41 2019-02-23 04:13:29,467 - INFO - Horizon 06, MAE: 12.41, MAPE: 0.3385, RMSE: 15.74 2019-02-23 04:13:29,529 - INFO - Horizon 07, MAE: 12.70, MAPE: 0.3432, RMSE: 16.11 2019-02-23 04:13:29,591 - INFO - Horizon 08, MAE: 13.00, MAPE: 0.3476, RMSE: 16.47 2019-02-23 04:13:29,652 - INFO - Horizon 09, MAE: 13.28, MAPE: 0.3512, RMSE: 16.78 2019-02-23 04:13:29,714 - INFO - Horizon 10, MAE: 13.53, MAPE: 0.3540, RMSE: 17.06 2019-02-23 04:13:29,775 - INFO - Horizon 11, MAE: 13.75, MAPE: 0.3562, RMSE: 17.30 2019-02-23 04:13:29,837 - INFO - Horizon 12, MAE: 13.95, MAPE: 0.3582, RMSE: 17.54 2019-02-23 04:20:40,879 - INFO - Epoch [90/100] (0) train_mae: 9.9064, val_mae: 10.6131 lr:0.000002 431.0s 2019-02-23 04:20:40,879 - WARNING - Early stopping at epoch: 90

liyaguang commented 5 years ago

Hi kyungeuuun, As mentioned in the README, there is a chance that the training loss will explode, the temporary workaround is to restart (from the last saved model before the explosion), or to decrease the learning rate earlier in the learning rate schedule.

kyungeuuun commented 5 years ago

Thanks! By the way, are the model parameters set correctly?

liyaguang commented 5 years ago

The default settings in the config file should be okay.

kyungeuuun commented 5 years ago

Thank you for your kindness. I'll try again.

liyaguang / DCRNN

Fail to reproduce best performance #22