moskewcz / boda

Boda: A C++ Framework for Efficient Experiments in Computer Vision
Other
63 stars 12 forks source link

Tuning of a CNN #22

Open dinhv opened 7 years ago

dinhv commented 7 years ago

The existing auto-tuner in Boda is only able to tune one operation per call i.e. we have to pass the desired operation and some additional informations. What if we want to tune a whole CNN? Basically we just have to tune each convolution of a layer. We can omit wisdom files and the tuning parameters MNb, MNt, Kb, vw are known. Is it possible to get the other required tuning informations just from a .prototxt input i.e. the content of ops-fn? Maybe Boda creates them when reading a .prototxt file for setting up a cnet? Example: --ops-fn="%(boda_test_dir)/conv-ops-debug.txt" (str_vals=(type=Convolution),nda_vals=(biases=(dims=(out_chan=96)),filts=(dims=(out_chan=96,in_chan=3,y=11,x=11)),in=(dims=(img=20,chan=3,y=227,x=227)),in_pad=(tn=none,dims=(y=0,x=0)),kern_sz=(tn=none,dims=(y=11,x=11)),out=(dims=(img=20,chan=96,y=55,x=55)),stride=(tn=none,dims=(y=4,x=4)),out_chans=(tn=uint32_t,v=96)))

moskewcz commented 7 years ago

for running NNs in boda, see #14 for an overview of the flow. with respect to autotuning, this flow doesn't do any: it just hueristically sets tuning parameters. more to the point, the tuning space itself is currently designed so that a single operating point can be heuristically set/chosen, rather than exploiting the ability to try multiple points.

first, let's review the current autotuning support. it is 'results-table-generation' level; the current flow is only semi-automated, and has several phases:

the profiling/autotuning data is a mapping:

for example, if the operation is: (str_vals=(type=Convolution),nda_vals=(biases=(dims=(out_chan=16)),filts=(dims=(out_chan=16,in_chan=192,y=1,x=1)),in=(dims=(img=1,chan=192,y=28,x=28)),in_pad=(tn=none,dims=(y=0,x=0)),kern_sz=(tn=none,dims=(y=1,x=1)),out=(dims=(img=1,chan=16,y=28,x=28)),out_chans=(tn=uint32_t,v=16),stride=(tn=none,dims=(y=1,x=1))))

then the wisdom file might contain (per-platform) runtimes for various functions that implement that operation, such as: (str_vals=(func_name=tconv,type=Convolution),nda_vals=(biases=(dims=(out_chan=16)),conv_has_relu=(tn=uint32_t,v=1),filts=(dims=(out_chan_blk=1,in_chan=192,y=1,x=1,out_chan_reg=4,out_chan_tile=4)),flags=(tn=uint32_t),in=(dims=(blk_bline=2,blk_bx=7,blk_in_chan=192,blk_y=16,blk_x=4)),in_pad=(tn=none,dims=(y=0,x=0)),in_ref=(dims=(img=1,chan=192,y=28,x=28)),out=(dims=(img=1,chan=16,y=28,x=28)),stride=(tn=none,dims=(y=1,x=1)),work=(tn=none,dims=(blk_bline=2,blk_bx=7,out_chan_blk=1,blk_y=16,out_chan_tile=4,pels=4,out_chan=4)))) /op_tune_wisdom_t op_tune_wisdom_t (use_be=ocl,MNt=4 4,MNb=8 8,tconv=1,tconv_max_ksz=11 11) op_run_t ocl:Fiji 0.00087184

which shows:

the (annotated) function description contains all information needed to generate the function that was used to achieve the given runtime for the given operation.

so, how does one autotune a whole NN? ideally, one should pick the best version of each operation in the NN based by considering not just each operation in isolation, but also considering the overhead of combining different operations, based on any needed data-format-conversions between them. however, it seems like a reasonable first step to just try to choose good implementations/functions per-operation first, and worry about net-level-optimization later (although some important ones may need to be dealt with earlier as special cases).

now, in theory, you can determine the best version/tuning-paramaters/annotations for each operation in any matter, online or offline. however, for both research and practice, i think it's impractical to use a purely online approach, due to the time required to autotune each operation, and the need to do many experiments while tweaking/changing different parts of the flow. a caching-based approach would be nicer that a pure-offline approach, but perhaps is an unneeded complexity -- i don't have a strong opinion there. but in any event, some repository/library that maps operations some set of possible implementations seems needed. right now, that's close to what the wisdom files are, but not quite. the wisdom files store runtimes for all points in the tuning space that were tried. however, for autotuning, we really only need the pareto-best points for each operation; in general, that means one 'best' function for each unique combination of input format, output format, and platform. currently, for the most part, operations have only a single input/output format, but k1conv is a notable exception due to having two output formats. in general, handling needed format conversions is one of the tricky aspects to getting full-net-autotuning to work (both in terms of correctness and speed).