Usage of shared memory for tiled convolution

I tried to auto tune a convolution with following input parameters: ops-prof --out-fn="%(boda_output_dir)/cnn_op_info.txt" --kg-tune-tag=ocl-def --func-mrd-toler="(cudnn_conv=4e-4)" --wisdom-out-fn="%(boda_output_dir)/wisdom.wis" --ops-fn="%(boda_test_dir)/conv-ops-debug.txt" --gen-data="(str_vals=(type=gen_data),nda_vals=(vi=(tn=float,v=0.0),mode=(tn=uint32_t,v=5)))" --wisdom-in-fn="%(boda_test_dir)/good_tr/conv-gen5/wisdom.wis" --op-tunes="(ocl-def=(use_be=ocl,),ocl-default=(use_be=ocl,MNt=32:16,MNb=8:16,Kb=8,k1conv=0,tconv=1,ipconv=0,tconv_max_ksz=19:19))"

This leads to a failure because the tuned kernel uses too much shared data. When I looked at the generated OpenCL code I found the line responsible for this error: LOCSHAR_MEM float all_smem[1056+13230]; // note: filts + in (or in detail: filts_smem_sz + in_blk_in_chan_stride)

My questions are:

How is in_blk_in_chan_stride being calculated and where is it used during the code generation process?
How is filts_smem_sz being calculated? As far as I know this value is set in the cnn_codegen.cc file in the function gen_op_tconv .

as normal for GPU programming, shared memory holds data that is reused multiple times per block (workgroup) across multiple threads, in order to reduce global memory loads/stores. to understand the per-block shared memory and registers requirements, one must understand the blocking strategy used.

each work-block for tconv computes some range of output channels across some range of input-windows/output-points. for tconv, the inputs/output regions are small rectangles. for a ksz=1x1 convolution with stride=1, the input and output regions will be the same size. however, for larger kernel sizes and/or strides>1, the input window will be larger than the output window: it will increase by (ksz - 1) in each dimension, and will also increase roughly proportional to the stride (i.e. a stride of 2 will need about a 2*2=4 times larger input area). in general, the work-blocking is calculated in the below file, with various special-cases depending on the variant: https://github.com/moskewcz/boda/blob/master/src/cnn_op.cc

for tconv, the total work is organized using 7 dimensions. the innermost 4 dimensions of 'work' define how much work is done per work block. the innermost 2 dimensions, 'pels' and 'out_chan', define (respectively) how many output points (spatially, adjacent in x) and how many output channels each thread will compute. hence, the number of output registers needed per thread is the product of these two dimensions:

float out_tile[%(work_pels_dim)*%(work_out_chan_dim)] = {0}; // tile of output for this thread to compute, stored in registers

the next two dimensions, 'blk_y' and 'out_chan_tile', define (respectively) how many adjacent rows of input and how many groups of output channels (where each group is of size 'out_chan') the entire work block will handle. since each thread handles one group of 'out_chan' output channels for one row of output points (of length 'pels'), we will need the product of 'blk_y' and 'out_chan_tile' threads for the block (and thus this defines the number of threads in the block).

combining these 4 per-work-block dimensions, we can see that each work-block will handle (blk_y*pels) output points (spatially organized as blk_y rows and pels columns of the output, with rows merged/concatenated across all images) for (out_chan_tile*out_chan) output channels. thus, the output tile for each tconv work-block is a rectangle of size YX == blk_y\pels. the input tile size is then determined from this depending on the kernel size and stride, and is defined by the innermost 2 dimensions of the 'in' argument, 'blk_y' and 'blk_x'. the next dimension of 'in' is 'blk_in_chan'. the stride of this dimension (in_blk_in_chan_stride) is thus the product of the blk_y and blk_x dimensions. as with all NDA stride template variables, this is calculated using the following function: https://github.com/moskewcz/boda/blob/master/src/rtc_func_gen.cc#L207 which is in turn called for all function arguments here: https://github.com/moskewcz/boda/blob/master/src/rtc_func_gen.cc#L400

(see also the incomplete/maybe-out-of-date CUCL guide: https://github.com/moskewcz/boda/blob/master/doc/cucl_guide.md )

the actual calculation of the sum for each output value is done per-input-channel. that is, all threads in the entire work-block cooperate to load one input channel of the work-block's input tile into shared memory for each iteration of the work-block-function-outer-loop. thus, the needed shared memory for the input is exactly this amount of data: the stride of the input array for the input channel dimension, which in CUCL is exposed as the in_blk_in_chan_stride template variable.

the other component of shared memory usage is for the filters. rather than loading all filter values needed for one iteration of the outer loop, tconv instead loads filter values row-by-row, inside the inner loop over the kernel y size. thus, the total shared memory needed for the filters is kernel_x_size*output_channels_per_block. the innermost 2 dimensions (out_chan_reg, out_chan_tile) of the the filters array define the number of output channels per thread and per the number of groups per block, so the product of these two dimensions is the number of output channels per block. note that filts.out_chan_reg==work.out_chan, and filts.out_chan_tile==work.out_chan_tile, as can be seen where they are set: https://github.com/moskewcz/boda/blob/master/src/cnn_op.cc#L304

the next dimension out of filts is 'x', the kernel x size. after than, the next dim is 'y', the kernel y size. so, the size of one input channel of filters, for one work-block worth of output channels, for all x values, is the stride of the 'y' dimension, so this is set as the filts_smem_sz: https://github.com/moskewcz/boda/blob/master/src/cnn_codegen.cc#L782

the remaining 3 outermost dimensions of 'work' correspond to the different independent work-blocks:

out_chan_blk is ceil( total_output_channels / output_channels_per_block )
blk_bx is ceil( total_output_columns / output_columns_per_block )
blk_bline is ceil( total_output_lines / output_lines_per_block )

thus, the total number of work blocks needed for a given case is the product of these three dimensions. but, each work-block is independent, so the values of these dims won't effect per-block shared-memory or register usage -- they only effect which parts of the input are used and which parts of the output are computed for each block.

for additional reference, here are links to the tconv CUCL template and the matching code generation:

https://github.com/moskewcz/boda/blob/master/test/rtc/tconv.cucl#L14

https://github.com/moskewcz/boda/blob/master/src/cnn_op.cc#L304

a separate, but related note: the current autotuning and code generation were co-designed, and also have various properties that predate autotuning entirely. one key idea for improving the autotuning is that the interface between these two stages can/should change once autotuning becomes an integral and reliable part of the normal flow.

for example, consider the case of tuning paramters such as tconv_max_ksz. this parameter exists only to hueristically control in which cases the tconv variant will be used. this is useful/sensible when the code must use a single set of tuning parameters across all cases.

however, for autotuning, it makes no sense to sweep over this parameter for each convolution, since there are only two possible outcomes for any setting: either tconv is enabled or disabled. further, even trying those two cases makes little sense; presumably the autotuner should instead simply try separate cases for each variant (tconv/k1conv/conv/etc...), rather than sweeping over heuristic parameters than enable/disable them for each case.

Thank you for the answers.

So filts_smem_sz = filts.dstride("y"). Using the same input for Boda as I described above, filts_smem_sz = filts.dstride("y") = rcg.op.nda_vals["filts"].stride = 1056, but where is this 1056 is being calculated exactly and how.

the stride of this dimension (in_blk_in_chan_stride) is thus the product of the blk_y and blk_x dimensions. as with all NDA stride template variables, this is calculated using the following function: https://github.com/moskewcz/boda/blob/master/src/rtc_func_gen.cc#L207

As far as I understand this function, it just set the stride value, but doesn't actually calculate the stride.

Thank you for the answers.

So filts_smem_sz = filts.dstride("y"). Using the same input for Boda as I described above, filts_smem_sz = filts.dstride("y") = rcg.op.nda_vals["filts"].stride = 1056, but where is this 1056 is being calculated exactly and how.
the stride of this dimension (in_blk_in_chan_stride) is thus the product of the blk_y and blk_x dimensions. as with all NDA stride template variables, this is calculated using the following function:
https://github.com/moskewcz/boda/blob/master/src/rtc_func_gen.cc#L207
As far as I understand this function, it just set the stride value, but doesn't actually calculate the stride.

ah, yep, fair enough. i'll correct/enhance my answer above later, but for now a quick reply:

the insert_nda_dims_sz() function i referenced is in charge of exporting ND-Array metadata (including dimension sizes and strides) to the CUCL level as template-variables/values. however, yes, it doesn't set the sizes/strides, those are properties of the ND-Arrays themselves.
currently, boda ND-Array don't have padding, so the strides are derived from the sizes. the stride at each dimension is the product of the sizes of all 'faster' (more inner) dimensions.
the strides are (explicitly) cached when the dimensions for each ND-Array are set/modified, in the function nda_t::calc_strides(), which in particular is called from the nda_t::init() functions. see the following search: https://github.com/moskewcz/boda/search?utf8=%E2%9C%93&q=calc_strides&type=
for any given ND-Array, the sizes (and thus the strides) can be set in different ways, depending on the semantics/usage of the ND-Array. some are set heuristically, some are related to input values, some are a mix, and so on. in particular, lots are set in cnn_op.cc -- for the case of the filters, see this line: https://github.com/moskewcz/boda/blob/master/src/cnn_op.cc#L304 (as linked in my original answer, but without an explanation of why i linked that particular line ...).

to break it down: the "y" stride will be the product of the:

"x" size (11 in this case)
"out_chan_reg" ( out chans per thread )
"out_chan_tile" ( # of blocks of out_chan_reg out chans per thread block )

where in this case, the product of "out_chan_reg" and "out_chan_tile" is 96 -- again, these are set in cnn_op.cc, and are derived heuristically from a combination of the tuning parameters and the specific geometry of the convolution to be performed. i can add more details on that if you want, but i'll need to run the specific example and trace the code in more detail than i can do at the moment. note that in this case, there are only 96 out chans total, so each work-block will handle all output channels. this means that even though the tuning parameters might suggest that each work-block handle more than 96 output chans, this won't happen, as it isn't sensible for this particular convolution. but for starters, here are some relevant lines:

https://github.com/moskewcz/boda/blob/master/src/cnn_op.cc#L77 // fetch tune params https://github.com/moskewcz/boda/blob/master/src/cnn_op.cc#L151 // determine tiling for input windows and chans https://github.com/moskewcz/boda/blob/master/src/cnn_op.cc#L203 // setting per-thread work dims

Ok, so currently I'm trying to derive a simple constraint for the usage of shared memory in tiled convolution. Do you have any hints, which parameters I have to consider e.g. input size, kernel size, vw, MNb, MNt, Kb and so on?

moskewcz / boda

Usage of shared memory for tiled convolution #21