unisound-ail / atlas

0 stars 0 forks source link

mxnet benchmark #2

Open pineking opened 7 years ago

pineking commented 7 years ago

结论

batch size 秒/Epoch GPU 占用率
单机单卡 128 51 90%
单机单卡 256 44 95%
单机两卡 256 24 90%
单机四卡 512 12 90%
单机八卡 1024 7.0 80%
两机,每机四卡 512 6.5 87%

机器配置

数据与网络

单机单卡

python train_cifar10.py --gpus=0
INFO:root:start with arguments Namespace(batch_size=128, benchmark=0, data_nthreads=4, data_train='data/cifar10_train.rec', data_val='data/cifar10_val.rec', disp_batches=20, gpus='0', image_shape='3,28,28', kv_store='device', load_epoch=None, lr=0.1, lr_factor=0.1, lr_step_epochs='200,250', max_random_aspect_ratio=0, max_random_h=36, max_random_l=50, max_random_rotate_angle=0, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0, min_random_scale=1, model_prefix=None, mom=0.9, network='resnet', num_classes=10, num_epochs=300, num_examples=50000, num_layers=110, optimizer='sgd', pad_size=4, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001)
[18:20:17] src/io/iter_image_recordio.cc:221: ImageRecordIOParser: data/cifar10_train.rec, use 4 threads for decoding..
[18:20:17] src/io/iter_image_recordio.cc:221: ImageRecordIOParser: data/cifar10_val.rec, use 4 threads for decoding..
INFO:root:Start training with [gpu(0)]
INFO:root:Epoch[0] Batch [20]   Speed: 939.26 samples/sec       Train-accuracy=0.128125
INFO:root:Epoch[0] Batch [40]   Speed: 984.92 samples/sec       Train-accuracy=0.205859
INFO:root:Epoch[0] Batch [60]   Speed: 983.00 samples/sec       Train-accuracy=0.219922
INFO:root:Epoch[0] Batch [80]   Speed: 978.03 samples/sec       Train-accuracy=0.229687
INFO:root:Epoch[0] Batch [100]  Speed: 982.58 samples/sec       Train-accuracy=0.253125
INFO:root:Epoch[0] Batch [120]  Speed: 963.30 samples/sec       Train-accuracy=0.263281
INFO:root:Epoch[0] Batch [140]  Speed: 998.05 samples/sec       Train-accuracy=0.297266
INFO:root:Epoch[0] Batch [160]  Speed: 978.65 samples/sec       Train-accuracy=0.308203
INFO:root:Epoch[0] Batch [180]  Speed: 984.20 samples/sec       Train-accuracy=0.331250
INFO:root:Epoch[0] Batch [200]  Speed: 995.25 samples/sec       Train-accuracy=0.327344
INFO:root:Epoch[0] Batch [220]  Speed: 988.98 samples/sec       Train-accuracy=0.332813
INFO:root:Epoch[0] Batch [240]  Speed: 993.27 samples/sec       Train-accuracy=0.353906
INFO:root:Epoch[0] Batch [260]  Speed: 991.30 samples/sec       Train-accuracy=0.363672
INFO:root:Epoch[0] Batch [280]  Speed: 987.99 samples/sec       Train-accuracy=0.367969
INFO:root:Epoch[0] Batch [300]  Speed: 990.45 samples/sec       Train-accuracy=0.380469
INFO:root:Epoch[0] Batch [320]  Speed: 984.57 samples/sec       Train-accuracy=0.397656
INFO:root:Epoch[0] Batch [340]  Speed: 999.65 samples/sec       Train-accuracy=0.429297
INFO:root:Epoch[0] Batch [360]  Speed: 995.40 samples/sec       Train-accuracy=0.425000
INFO:root:Epoch[0] Batch [380]  Speed: 987.50 samples/sec       Train-accuracy=0.421094
INFO:root:Epoch[0] Resetting Data Iterator
INFO:root:Epoch[0] Time cost=51.785
INFO:root:Epoch[0] Validation-accuracy=0.443236
INFO:root:Epoch[1] Batch [20]   Speed: 1041.97 samples/sec      Train-accuracy=0.443359
INFO:root:Epoch[1] Batch [40]   Speed: 991.77 samples/sec       Train-accuracy=0.483203
INFO:root:Epoch[1] Batch [60]   Speed: 1008.41 samples/sec      Train-accuracy=0.446094
INFO:root:Epoch[1] Batch [80]   Speed: 1004.69 samples/sec      Train-accuracy=0.471484
INFO:root:Epoch[1] Batch [100]  Speed: 1001.01 samples/sec      Train-accuracy=0.484375
INFO:root:Epoch[1] Batch [120]  Speed: 998.18 samples/sec       Train-accuracy=0.501953
INFO:root:Epoch[1] Batch [140]  Speed: 980.64 samples/sec       Train-accuracy=0.498828
INFO:root:Epoch[1] Batch [160]  Speed: 1003.91 samples/sec      Train-accuracy=0.535547
INFO:root:Epoch[1] Batch [180]  Speed: 1000.82 samples/sec      Train-accuracy=0.544531
INFO:root:Epoch[1] Batch [200]  Speed: 992.89 samples/sec       Train-accuracy=0.530078
INFO:root:Epoch[1] Batch [220]  Speed: 1010.66 samples/sec      Train-accuracy=0.514453
INFO:root:Epoch[1] Batch [240]  Speed: 1005.17 samples/sec      Train-accuracy=0.535547
INFO:root:Epoch[1] Batch [260]  Speed: 988.56 samples/sec       Train-accuracy=0.534766
INFO:root:Epoch[1] Batch [280]  Speed: 989.83 samples/sec       Train-accuracy=0.545703
INFO:root:Epoch[1] Batch [300]  Speed: 1003.98 samples/sec      Train-accuracy=0.557031
python train_cifar10.py --gpus=0 --batch-size=256
INFO:root:start with arguments Namespace(batch_size=256, benchmark=0, data_nthreads=4, data_train='data/cifar10_train.rec', data_val='data/cifar10_val.rec', disp_batches=20, gpus='0', image_shape='3,28,28', kv_store='device', load_epoch=None, lr=0.1, lr_factor=0.1, lr_step_epochs='200,250', max_random_aspect_ratio=0, max_random_h=36, max_random_l=50, max_random_rotate_angle=0, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0, min_random_scale=1, model_prefix=None, mom=0.9, network='resnet', num_classes=10, num_epochs=300, num_examples=50000, num_layers=110, optimizer='sgd', pad_size=4, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001)
[18:24:29] src/io/iter_image_recordio.cc:221: ImageRecordIOParser: data/cifar10_train.rec, use 4 threads for decoding..
[18:24:29] src/io/iter_image_recordio.cc:221: ImageRecordIOParser: data/cifar10_val.rec, use 4 threads for decoding..
INFO:root:Start training with [gpu(0)]
INFO:root:Epoch[0] Batch [20]   Speed: 1232.52 samples/sec      Train-accuracy=0.153320
INFO:root:Epoch[0] Batch [40]   Speed: 1151.11 samples/sec      Train-accuracy=0.232422
INFO:root:Epoch[0] Batch [60]   Speed: 1158.12 samples/sec      Train-accuracy=0.259766
INFO:root:Epoch[0] Batch [80]   Speed: 1141.59 samples/sec      Train-accuracy=0.278906
INFO:root:Epoch[0] Batch [100]  Speed: 1147.25 samples/sec      Train-accuracy=0.301758
INFO:root:Epoch[0] Batch [120]  Speed: 1150.68 samples/sec      Train-accuracy=0.317969
INFO:root:Epoch[0] Batch [140]  Speed: 1148.73 samples/sec      Train-accuracy=0.349414
INFO:root:Epoch[0] Batch [160]  Speed: 1144.52 samples/sec      Train-accuracy=0.352539
INFO:root:Epoch[0] Batch [180]  Speed: 1141.29 samples/sec      Train-accuracy=0.379102
INFO:root:Epoch[0] Resetting Data Iterator
INFO:root:Epoch[0] Time cost=44.563
INFO:root:Epoch[0] Validation-accuracy=0.433105
INFO:root:Epoch[1] Batch [20]   Speed: 1206.70 samples/sec      Train-accuracy=0.398438
INFO:root:Epoch[1] Batch [40]   Speed: 1145.39 samples/sec      Train-accuracy=0.412109
INFO:root:Epoch[1] Batch [60]   Speed: 1144.15 samples/sec      Train-accuracy=0.430273
INFO:root:Epoch[1] Batch [80]   Speed: 1142.66 samples/sec      Train-accuracy=0.441211
INFO:root:Epoch[1] Batch [100]  Speed: 1141.73 samples/sec      Train-accuracy=0.458203
INFO:root:Epoch[1] Batch [120]  Speed: 1143.65 samples/sec      Train-accuracy=0.470508
INFO:root:Epoch[1] Batch [140]  Speed: 1142.52 samples/sec      Train-accuracy=0.489258
INFO:root:Epoch[1] Batch [160]  Speed: 1140.67 samples/sec      Train-accuracy=0.513672
INFO:root:Epoch[1] Batch [180]  Speed: 1138.96 samples/sec      Train-accuracy=0.531445
INFO:root:Epoch[1] Resetting Data Iterator
INFO:root:Epoch[1] Time cost=43.542
INFO:root:Epoch[1] Validation-accuracy=0.547070
INFO:root:Epoch[2] Batch [20]   Speed: 1212.84 samples/sec      Train-accuracy=0.547461
INFO:root:Epoch[2] Batch [40]   Speed: 1140.71 samples/sec      Train-accuracy=0.553906

单机多卡

单机两卡

多机多卡

两机,每机四卡

xuerq commented 7 years ago

补充测试了mxnet 在 k8s 集群上的分布式性能

batch_size=128 image

image

batch_size=256 image

image

pineking commented 7 years ago

@xuerq 关于分布式 mxnet 的多机多卡实验

pineking commented 7 years ago

@xuerq 一些结论看是否正确:

xuerq commented 7 years ago

没用单独测过 ssh 的分布式速度 对比上一帖子:ssh 方式 分布式2机*每机4卡的速度 ssh: 8010.492(张图像每秒) k8s: 7586.296786(张图像每秒) k8s 约是 ssh 速度的 94.7%

5% 的差异是不是来自 docker 的封装? docker 应该会有 1% 左右的性能下降

wangkuiyi commented 7 years ago

这个5%的差异是不是来自pod之间的通信依赖的是物理网络之上的一层软件实现的overley network?

按说docker不应该引入什么性能差异吧。

xuerq commented 7 years ago

有可能,但也不好说,5% 差距可能是正常的性能波动,可能需要多做几轮对比测试验证一下