Open dobkeratops opened 9 years ago
I'm interested. I think it would be good to first have a test case or proof-of-concept for the Epiphany. Given the limitations of the core, I would like to see a proof-of-concept parallel implementation that can efficiently handle the inter-core communication to do something useful (even though all PAL functions are currently serial).
there's no doubt in my mind it could do it - anything a cache can do, intercore-communication can - you just sometimes need to manually discriminate what is kept. (in the case of convolutions, I think most of the time you'd want to assume the kernel was kept local, but for completeness one might have to also have the option to stream the kernels? )
Another question is how far to go with intercore communication: a single function to do multiple passes would give you the opportunity to run a whole pipeline on chip (where a GPU would be serial between layers, with shared L2 communication); however the API's would begin to explode in complexity; I got the impression PAL wants to be quite generic & straightforward. (you might need further details like describing streams of data)
e.g, something like this:-
// describe a convolution step
struct Convolution4dDesc
{
float* kernel;
int kernel_size[4];//4d tensor; supply trailing 1's for 2d or 3d..
float* bias;/* for each slice of 4th axis*/
float min; float max; /* output values clamped to range*/
int xy_pooling_reduction_factor; /* optional reduction in 1st 2 axes, or 1 for no effect */
};
/* apply a series of convolutions (may use on chip communication between passes) */
/* 3d data with a 4D kernel; any dimension being '1' can be skipped,
e.g. (32x32x64x1)image with 32x32x64x48 kernel
produces an output that can be treated as 32x32x48 x1 or 32x32x1x48 for the next.*/
/* */
void multi_conv4d(
const float* input, int input_size[4], /*4d tensor,e.g. start with width x height x 3channels x 1 */
float* output, /* output size is computed from convolutions & reductions*/
const Convolution4dDesc* conv,int num_conv,
float* workspace /* buffer for temporary layer eval, contents is undefined after,can be null for 1 convolution */
);
void calc_conv4d_size(
const Convolution4dDesc* in, int num_conv,
int output_size[4], int *workspace_size);
// the implementation could divide the input into strips, pipelining a traversal,
//beginning processing the next layer when enough scanlines are available from the previous;
//or if insufficient on chip memory is available to make this worthwhile,
// it could work in serial between layers, using the 'workspace' buffer.
If you're streaming the kernel weights from off-chip memory, it seems to me that it's no longer a very good choice to use the Epiphany. Copying the weights to perform a single multiply-accumulate per weight is not an effective use of the architecture because it quickly becomes bandwidth-bound. The most effective network would include preconfigured weights and inter-core communication while streaming the input. An effective inter-core communication scheme must be accounted for pretty much any network size that performs something useful.
Currently PAL can address the computation part, but the communication here is just not as straight forward as moving edges on image processing. The communication buffers may be asymmetric in size and do not necessarily translate to nearest neighbors (on the on-chip network). It's complicated.
Still, I agree that the prospect of doing it on the Epiphany is interesting since it does allow you to do the entire NN without explicit synchronization to global memory between layers like on a GPU.
Copying the weights to perform a single multiply-accumulate per weight
for convolutions, you'd load a weightmap which is then invoked at many times across the image (e.g. a 512x512 image x 16x16x64 kernel is a 1mb image x 64k kernel), so there should still be some value.
But perhaps this can be improved further by a single call to apply multiple layers to a batch of images. (upload kernel, upload image 1, classify, upload image 2, ... )..
or by providing a function to upload weights, then another function to apply them ( some gl-esque interface).. int load_weights(weights*,layout_descriptor); // returns weight map ID convolute(weightmap_id,data,output); void unload_weights(weightmap_id); // release workspace
For image recognition, they're typically ~60mb of weights? Its going to be a while before there's an epiphany chip capable of holding all that; but even then, I thought holding the intermediate layer values on chip would still be a win. I guess if they do actually make a 1024core chip with 128k per core that could do it amazingly well
Yes, thinking toward the future, a larger chip should do quite well. For now, it would be beat to focus on toy codes and proofs of concept. This is architecture and algorithm research since it's not like performance or energy efficiency records will be smashed.
I think inverting the data loading so that it's working on a batch of images doesn't solve the bandwidth limitation. It still uses the data just once. I could be wrong or misunderstanding.
"working on a batch of images doesn't solve the bandwidth limitation. "
it depends on the size of the local-store. If the filter weights fit on the chip, it helps. if they don't.. it doesn't. however if the API is there to support loading the weights (on any device that can), it will allow, say, an openCL implementation to transfer them to device memory (or whatever).
Somewhere between a fully fledged neural-net library, and the existing convolution functions in pal/image - would there be any elements that are a good fit for the Pal library ?
imagine the following function:
3D x 3D -> 2D convolution, with bias, clamped output (supply a minimum, e.g. zero,-1, -FLT_MAX for no effect?) for 'ReLU', and optional max-pooling (N=1,2,3.. ? N=1 for no pooling) to reduce the output image size;
This would be a big chunk of the basic layer-evaluation of the deep-learning image recognition algorithms. You'd invoke multiple 3dx2d convolutions for a 3D result.
It would be important to include the ReLU & max-pooling since this would avoid significant memory traffic. You could provide a helper function for a 3d x 3d->2d convolution without those steps that just calls it with (min=0, pooling=1) ... or have an outright separate function if needed.
An input could be (width x height x channels) - image-planes - or a true 3D image (volume data).
Then imagine functions for training such a thing. (backpropogation and accumulating error deltas throught the weights).
I think a single step like that would go a long way to leveraging the epiphany hardware; you'd have a lot of data-reuse, perhaps uploading an entire 3D filter across multiple cores, then streaming an image through it .
This would be a stepping stone to a full neural net library which could implement pipelines between net layers. Getting some capability in the Pal library might make the epiphany chip more appealing to neural-net/Deep-learning researchers.
Short of that, are there other ways to generalize 2D convolutions to be more useful ?
e.g if the 3rd dimension was interleaved (e.g. [row0 [r0,g0,b0,r1,g1,b1...] row1[r0,g0,b0, r1,g1,b1 ]...]), could you treat it as a 2D convolution with strided input (then merely adding 'col_step', 'row_step' parameters e.g. col_step=3 for r,g,b input..). This would still require the insertion of a clamping & max-pool stage to your 2d convolution, and again if worried about parameter explosion , a simple helper could provide a streamlined interface. Thresholding/clamping is fairly common in image-processing I think (e.g. extracting certain edges from an image, bluring highlights, keeping results in a output range for bit-reduction, etc). Stepped inputs/outputs would allow using this function for filtered image down scaling, or perhaps colour-space conversions
/2d convolution, extended /
( also a minor point, I would have thought it more logical to order 'cols, rows' as per width/height for images stored in memory; you can still label them rows/cols for people thinking about it as a matrix in the linear-algebra sense.)