hclhkbu / dlbench

Benchmarking State-of-the-Art Deep Learning Software Tools
http://dlbench.comp.hkbu.edu.hk/
MIT License
170 stars 47 forks source link

[BUG] Epoch size's change results in wrong performance result #21

Open shishaochen opened 6 years ago

shishaochen commented 6 years ago

From benchmark.py and configs/*.config, we know dlbench provide capability of changing epoch size if we want a quick test without going through full data set.

# From configs/tensorflowbm6_gpu21.config
fc; fcn5; 1; 1; 4096; 40; 60000; 0.05 # If we change 60000, the speed result of MXNet will be wrong
cnn; resnet; 1; 1; 128; 40; 50000; 0.01 # No matter what epoch size is, the experiments always run on full data set

However, LSTM scripts of MXNet, CNTK, TensorFlow doesn't accept epoch size as command-line arguments. That's to say, no matter what epoch size we set in config, these scripts still train on full dataset.

# From tools/tensorflow/rnn/lstm/reader.py
raw_data = reader.ptb_raw_data(FLAGS.data_path) # No epoch size is used

While FCN5, AlexNet, ResNet scripts of CNTK, TensorFlow accepts epoch size and will truncate data if smaller than actual size, scripts of MXNet does not use it to control training data size.

# From  tools/mxnet/common/data.py
train = mx.io.ImageRecordIter(path_imgrec = args.data_dir + args.data_train, ,,,) # No epoch size used

What's worse, the speed calculating codes in tools/mxnet/mxnetbm.py uses epoch size we set and batch size to calculate batch count, which results in wrong value of "seconds per batch" if the epoch size is different from the real value.

# from tools/mxnet/mxnetbm.py
numSamples = args.epochSize
... ...
avgBatch = (avgEpoch/int(numSamples))*float(batchSize)

Please correct your network codes ASAP.

shyhuai commented 6 years ago

Please be noted that in RNN of MXNet, numSamples is get from the log file, but not from the input parameter. The code: https://github.com/hclhkbu/dlbench/blob/master/tools/mxnet/mxnetbm.py#L116. We are refactoring our codes and will release soon to support more flexible extention of testing. Any comments are welome. Thank you!

shishaochen commented 6 years ago

@shyhuai Yes, numSamples of LSTM is parsed from log. What about AlexNet, ResNet, and FCN?

shyhuai commented 6 years ago

@shishaochen Because it was a little difficult to config the epoch size of RNN in the older version of MXNet, we just let it read all the data for convenient. It could be improved for the newer version. For CNN and FCN, the epoch size is configurable in the .config file.

shishaochen commented 6 years ago

@shyhuai I mean, in CNN, FCN scripts of MXNet, although there exists an argument named "num-examples", it is never used to truncate training data. For example, if you set epoch size to 10000 (while real size is 50000), then the average seconds per mini batch will be 5 times bigger.
If you think, it is configurable, could you tell me where the code is? The RecordIter of MXNet has no way to configure epoch size like CNTK.