[BUG] Learning rate is not passed to network scripts

shishaochen commented 6 years ago

From benchmark.py and configs/*.config, we know dlbench provide capability of changing learning rate.
However, only Caffe, Torch, MXNet accepts learning rate argument while CNTK, TensorFlow ignores them.

# tools/cntk/cntkbm.py has no lr argument defined.
# tools/tensorflow/tensorflow.py has no lr argument defined.

Furthermore, the learning rate is not the same when running benchmark. For example, TensorFlow uses constant value while MXNet's learning rate will change during training.

# From tools/mxnet/common/fit.py
steps = [epoch_size * (x-begin_epoch) for x in step_epochs if x-begin_epoch > 0] # Default value of step_epochs is '200,250' from tools/mxnet/train_cifa10.py
return (lr, mx.lr_scheduler.MultiFactorScheduler(step=steps, factor=args.lr_factor))
......
optimizer_params = {'learning_rate': lr,
            'momentum' : args.mom,
            'wd' : args.wd,
            'lr_scheduler': lr_scheduler} # This scheduler will change learning rate during training

Please let all tools support learning rate parameter or just delete learning rate from config.

shyhuai commented 6 years ago

The schedule of learning rate of MXNet is not used. Please check the code: https://github.com/hclhkbu/dlbench/blob/master/tools/mxnet/common/fit.py#L8. The parameter of lr_factor is set to be None. For CNTK and TF, we set them to be fixed.

shishaochen commented 6 years ago

@shyhuai But you set the default value of lr_factor to 0.1 at https://github.com/hclhkbu/dlbench/blob/master/tools/mxnet/common/fit.py#L63.

train.add_argument('--lr-factor', type=float, default=0.1, help='the ratio to reduce lr on each step')

So, no matter whether we explicitly set lr_factor in command-line arguments, argparse.ArgumentParser will always set the lr_factor. Check the log of MXNet MNIST:

INFO:root:start with arguments Namespace(batch_size=1024, data_dir='/home/shaocs/dlbench/dataset/mxnet/mnist', disp_batches=100, gpus='0', kv_store='device', load_epoch=None, lr=0.05, lr_factor=0.1, lr_step_epochs='10', model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=2, num_examples=60000, num_layers=None, num_nodes=1, optimizer='sgd', test_io=0, top_k=0, wd=1e-05)
......
INFO:root:Update[586]: Change learning rate to 5.00000e-03 # This is printed after 8 epochs

shyhuai commented 6 years ago

@shishaochen Thanks for you feedback. Since we set lr_factor=1 in the script of: mxnetbm.py, the learning rate will not be changed during training. If you use the script of mxnetbm.py, it could be no problem. Here is the log for your reference: http://dlbench.comp.hkbu.edu.hk/logs/?f=mxnet-fc-fcn5-gpu0-K80-b512-Tue_Mar__7_10:52:06_2017-gpu20.log. In case of misunderstanding, I have revised the code to set the default value to None. Thank again for your report.

shishaochen commented 6 years ago

@shyhuai Sorry. I cannot find "factor" set in https://github.com/hclhkbu/dlbench/blob/master/tools/mxnet/mxnetbm.py. Maybe you set it locally but the change is not committed yet.

hclhkbu / dlbench

[BUG] Learning rate is not passed to network scripts #22