Open aolofsson opened 9 years ago
I am interested. Will look at it for a while (not sure if I have the expertise required). Also, I have one doubt: now that Brown Deer is (or has) implementing(ed) the MPI library, wouldn't it be possible to port a standard MPI implementation of BLAS (sounds easy also, doesn't it :) )
Great!! The BDT implementation of MPI does not offer full support for the standard, so that would depend. Definitely worth a try though to see what breaks:-)
I'm interested in giving the BLIS route a shot as well, though I'm similarly inexperienced at these things. Writing a kernel seems relatively straightforward, but we'd want to spread out the work over the cores as well, I'd think. Aside from that, I imagine it might be tricky to get BLIS to use the Epiphany cores to their full potential (keeping in mind the grid network with its variable latency to different areas of memory). If nothing else, though, BLIS seems like a better base than starting entirely from scratch.
I noticed BLIS also supports OpenMP parallellisation, which might be an easy avenue for that parallellisation, using the OMPi compiler for Epiphany.
So far I've mostly looked at a basic strategy for going about implementing linear algebra subroutines, and I thought I'd share the ideas I've had so far:
I've looked into how BLIS is put together and what the matrix kernels should do. There's a fair number of kernels that may be useful to implement, but they claim (and I trust them :)) that the gemm (small matrix multiply-accumulate) and possibly gemmtrsm (matrix-inverse type thing) kernels are the interesting ones. The documentation is quite good, and this shouldn't be a huge hassle to get running.
Above these kernels there are so-called macro-kernels that basically form outer for loops around the kernel. While there is support for multiprocessing via either pthreads or OpenMP, this doesn't seem like a great model for the Parallella, because the way it's put together kind of assumes the kernels run on the same arch as the bookkeeping code, which isn't going to be the case here.
I've compiled BLIS as a whole for the Epiphany arch, and this was fairly trivial to do (I used the reference implementation config and typed 'make CC=e-gcc', basically). However, the resulting library, for the Epiphany arch only, clocks in at 12 megabytes, which doesn't seem in line with the 'small code footprint' goal here. :)
Based on this, I think our best option here would be to basically make a pretty radical fork of BLIS, with the goal of restructuring it in a manner more friendly (though not limited) to the Parallella set-up -- make the kernel (and possibly some additional bits) more able to run as a separate entity, possibly on a different architecture, then shuffle the data around in shared memory more explicitly. There's a lot of ideas we could use in BLIS, but they don't seem like they /quite/ fit our mold.
I'm curious to see what others think, particularly because I don't have a lot of experience with either bigger software projects, numerical codes or parallel processing.
P.S. While the kernel operations are tractable enough and could be useful for the rest of the PAL project, including parallellism in particular seems like a big enough (and different enough) project that it might make more sense to separate it from the rest of PAL. It seems like it'd work better as its own library, with its own API (like BLAS/BLIS, basically). I'm curious to see what you guys have to say about this as well.
Dear all,
I would be very interested in contributing. Is there any progress yet?
Miguel,
How long do we need to keep the pub open?:-)
Are you starting to create the config file and doing the 4x4 multiply routine?
From FAQ: "Yes. In order to achieve high performance, BLIS requires that hand-coded kernels and micro-kernels be written and referenced in a valid BLIS configuration https://code.google.com/p/blis/wiki/ConfigurationHowTo. These components are usually written by developers and then included within BLIS for use by others.If high performance is not important, then you can always build the reference implementation on any hardware platform. The reference implementation does not contain any machine-specific code and thus should be very portable."
From BLIS retreat: http://www.cs.utexas.edu/users/flame/BLISRetreat/BLISRetreatTalks/Fran_BLIS_Retreat.pdf
Goal is to have this ready by fall or bust!
Andreas
On Tue, Jul 14, 2015 at 8:10 AM, MiguelTasende notifications@github.com wrote:
Working on it...
Hope to update soon (don't close the pub yet, please :) ).
— Reply to this email directly or view it on GitHub https://github.com/parallella/pal/issues/39#issuecomment-121217575.
Andreas Olofsson, CEO/Founder at Adapteva Cell: +1 781 325 6688 Twitter: @adapteva Web: adapteva.com
@roysmeding Sorry for not responding earlier! Although the 12MB is a concern, the real test is the size of each function (like SGEMM). It's a big library, but most applications only need a small fraction of the functions.
As a basic building block, we need a very fast optimized linear algebra call that runs single threaded. The parallel framework will be built on top of this basic building block.
A great starting point is BLIS from the University of Texas:
https://github.com/flame/blis
Major tasks: -Create the optimized assembly macro needed at the base (basically a 4x4 matrix multiply) -Run BLIS through the epiphany tool chain to create the library. (sounds easy doesn't it...)