merrymercy / tvm-mali

Optimizing Mobile Deep Learning on ARM GPU with TVM
http://tvmlang.org/2018/01/16/opt-mali-gpu.html
MIT License
179 stars 28 forks source link

about tune #4

Closed janboeye closed 6 years ago

janboeye commented 6 years ago

@merrymercy

Is there any tune guide?

Which parameters could be tuned? Why set num_thread = 8?

Thanks

merrymercy commented 6 years ago

Lines below tune_config is the logic for setting tunable parameters.

https://github.com/dmlc/tvm/blob/1d6df5a187127f514ea48f6a8a74c77ff59c89f5/topi/python/topi/mali/conv2d.py#L195-L211 https://github.com/dmlc/tvm/blob/1d6df5a187127f514ea48f6a8a74c77ff59c89f5/topi/python/topi/mali/conv2d.py#L258-L282

for spatial_pack, tunable parameter is VH, VW, VC and num_thread, I use gridsearch to set them. You can also use grid search for other workloads. Some sample code can be find here. We feed in our config in L99.

https://github.com/merrymercy/tvm-mali/blob/90395b509e01e6de6b83f5f2f8c4715dd5a024e9/layer-test/test_conv2d.py#L87-L100

janboeye commented 6 years ago

@merrymercy If it run on bifrost architecture, what parameters need to be tuned?

merrymercy commented 6 years ago

I do not have a bitfrost device. I cannot help you with this problem. But maybe I will tune a bitfrost device in the near future.

janboeye commented 6 years ago

@merrymercy It looks like conv2d implementation is not good for bifrost architecture. If I comment out @conv2d.register(["mali"]) and @generic.schedule_conv2d_nchw.register(["mali"]), let conv2d goto opencl implementation, it could achieve better performance.

janboeye commented 6 years ago

@merrymercy Why _schedule_im2col_conv2d use __local on mali architecture?

merrymercy commented 6 years ago

Sorry I am traveling these days and my laptop was broken. It seems that you have solved your issues.

If you remove the registration of Mali, it will use cuda's schedule. I am interested in the bitfrost's results. Which gpu do you use? Could you post more details about the performance?

janboeye commented 6 years ago

@merrymercy If use cuda's schedule, the result is about 100ms from 180ms which is using mali's schedule. the unroll number is too large on bifrost architecture.