Tuning of a CNN - Githubissues

for running NNs in boda, see #14 for an overview of the flow. with respect to autotuning, this flow doesn't do any: it just hueristically sets tuning parameters. more to the point, the tuning space itself is currently designed so that a single operating point can be heuristically set/chosen, rather than exploiting the ability to try multiple points.

first, let's review the current autotuning support. it is 'results-table-generation' level; the current flow is only semi-automated, and has several phases:

(1) a list of operations is generated using the write_op_sigs option in rtc_fwd.cc -- see the cnn-prof-op-sigs-gen.py script for a high level example
(1.5) the list of operations can be manually trimmed/filtered, perhaps to only convolutions, or only some subset of convolutions.
(2) the 'autotuner' (really just a profiler) takes a list of operations and a set of tuning parameters (currently, a small, hand-chosen set of points/hints for the existing not-designed-for-autotuning tuning-space/code-generation-flow.
(3) the effect/results of autotuning can be reported by analyzing the profile data using the wis-ana mode and wis-plot script. see boda-aa-fig-gen.py for high-level automation of the results generation.

the profiling/autotuning data is a mapping:

from operations (as generated by op-sigs-gen.py, which indirectly is gathering them from the NN description/prototxt),
to a set of runtime/function pairs, where each specific (annotated) function implements the given operation.

for example, if the operation is: (str_vals=(type=Convolution),nda_vals=(biases=(dims=(out_chan=16)),filts=(dims=(out_chan=16,in_chan=192,y=1,x=1)),in=(dims=(img=1,chan=192,y=28,x=28)),in_pad=(tn=none,dims=(y=0,x=0)),kern_sz=(tn=none,dims=(y=1,x=1)),out=(dims=(img=1,chan=16,y=28,x=28)),out_chans=(tn=uint32_t,v=16),stride=(tn=none,dims=(y=1,x=1))))

then the wisdom file might contain (per-platform) runtimes for various functions that implement that operation, such as: (str_vals=(func_name=tconv,type=Convolution),nda_vals=(biases=(dims=(out_chan=16)),conv_has_relu=(tn=uint32_t,v=1),filts=(dims=(out_chan_blk=1,in_chan=192,y=1,x=1,out_chan_reg=4,out_chan_tile=4)),flags=(tn=uint32_t),in=(dims=(blk_bline=2,blk_bx=7,blk_in_chan=192,blk_y=16,blk_x=4)),in_pad=(tn=none,dims=(y=0,x=0)),in_ref=(dims=(img=1,chan=192,y=28,x=28)),out=(dims=(img=1,chan=16,y=28,x=28)),stride=(tn=none,dims=(y=1,x=1)),work=(tn=none,dims=(blk_bline=2,blk_bx=7,out_chan_blk=1,blk_y=16,out_chan_tile=4,pels=4,out_chan=4)))) /op_tune_wisdom_t op_tune_wisdom_t (use_be=ocl,MNt=4 4,MNb=8 8,tconv=1,tconv_max_ksz=11 11) op_run_t ocl:Fiji 0.00087184

which shows:

the output (annotated) function description,
the input to the tuning
the platform
the runtime on that platform

the (annotated) function description contains all information needed to generate the function that was used to achieve the given runtime for the given operation.

so, how does one autotune a whole NN? ideally, one should pick the best version of each operation in the NN based by considering not just each operation in isolation, but also considering the overhead of combining different operations, based on any needed data-format-conversions between them. however, it seems like a reasonable first step to just try to choose good implementations/functions per-operation first, and worry about net-level-optimization later (although some important ones may need to be dealt with earlier as special cases).

now, in theory, you can determine the best version/tuning-paramaters/annotations for each operation in any matter, online or offline. however, for both research and practice, i think it's impractical to use a purely online approach, due to the time required to autotune each operation, and the need to do many experiments while tweaking/changing different parts of the flow. a caching-based approach would be nicer that a pure-offline approach, but perhaps is an unneeded complexity -- i don't have a strong opinion there. but in any event, some repository/library that maps operations some set of possible implementations seems needed. right now, that's close to what the wisdom files are, but not quite. the wisdom files store runtimes for all points in the tuning space that were tried. however, for autotuning, we really only need the pareto-best points for each operation; in general, that means one 'best' function for each unique combination of input format, output format, and platform. currently, for the most part, operations have only a single input/output format, but k1conv is a notable exception due to having two output formats. in general, handling needed format conversions is one of the tricky aspects to getting full-net-autotuning to work (both in terms of correctness and speed).

moskewcz / boda

Tuning of a CNN #22