gonum / blas

A BLAS implementation for Go [DEPRECATED]
172 stars 16 forks source link

blas: Existence of complex blas? #106

Closed mathDR closed 5 years ago

mathDR commented 9 years ago

Hi, is there a plan to implement complex versions of the level 1 through 3 blas functionality? I guess both a complex conjugate and hermition operation would need to be implemented.

Thoughts on this?

btracey commented 9 years ago

I believe the eventual goal is to support complex matrices (in what would be matrix/mat128 or something). When this happens, having a complex128 BLAS is desirable. However, I have no timeline in mind for this (not sure about others). I imagine we would accept PRs for Complex128 implementations in native (assuming they were well implemented and had tests). Any work on (object-oriented) complex matrices should be discussed on gonum-dev first.

Please note that the C-wrappers to the complex BLAS are functional.

kortschak commented 9 years ago

A sensible approach would be to leverage the effort @btracey has put into the native float64 implementation. To this end, the most desirable input would be to a code gen tool that works on native to generate the complex code from the float code.

btracey commented 9 years ago

Agreed. It would be especially nice in that the existing code is likely to be improved (for example, concurrency), and being able to modify complex128 with the new change automatically would be an important part of the development story.

mathDR commented 9 years ago

Okay, that sounds reasonable. Historically there are separate instances of parallel versus sequential implementations (PBLAS for blas and ScaLAPACK for LAPACK). So an incremental process would be to do a sequential implementation. Then do a concurrent implementation with a flag that checks to see if the maximum number of go routines for a user is > 1. If so, then automatically switches to the concurrent implementation.

On Mon, Feb 2, 2015 at 4:35 PM, Brendan Tracey notifications@github.com wrote:

Agreed. It would be especially nice in that the existing code is likely to be improved (for example, concurrency), and being able to modify complex128 with the new change automatically would be an important part of the development story.

— Reply to this email directly or view it on GitHub https://github.com/gonum/blas/issues/106#issuecomment-72569305.

btracey commented 9 years ago

ScaLAPACK is designed for use with MPI. In order to use ScaLAPACK a special compiler is needed, and the program needs to be designed with the single-source, multiple-run (or whatever the right terminology is) in mind. Go already has built-in communication primitives that are very cheap. native.Dgemm is already implemented concurrenty. Of course, the Go version only works on shared memory machines, whereas ScaLAPACK can work on large clusters. Distributed memory computation is outside the scope of gonum (though of course is possible, and our current implementations provide a basis for that).

mathDR commented 9 years ago

Agreed. ScaLAPACK was just an example. Basically my concern is that if there are go routines being called, and the user has a single core, single threaded processor, they will pay the overhead of calling the go routines without any benefit.

Of course, with todays computers being what they are, this may not be of concern.

On Mon, Feb 2, 2015 at 5:03 PM, Brendan Tracey notifications@github.com wrote:

ScaLAPACK is designed for use with MPI. In order to use ScaLAPACK a special compiler is needed, and the program needs to be designed with the single-source, multiple-run (or whatever the right terminology is) in mind. Go already has built-in communication primitives that are very cheap. native.Dgemm is already implemented concurrenty. Of course, the Go version only works on shared memory machines, whereas ScaLAPACK can work on large clusters. Distributed memory computation is outside the scope of gonum (though of course is possible, and our current implementations provide a basis for that).

— Reply to this email directly or view it on GitHub https://github.com/gonum/blas/issues/106#issuecomment-72572427.

btracey commented 9 years ago

Nope. If you look at the implementation for Dgemm the number of workers is equal to GOMAXPROCS.

mathDR commented 9 years ago

Okay GREAT! This is what I was hoping for. Sorry, I haven't got to Dgemm yet...

On Tue, Feb 3, 2015 at 10:38 AM, Brendan Tracey notifications@github.com wrote:

Nope. If you look at the implementation for Dgemm the number of workers is equal to GOMAXPROCS.

— Reply to this email directly or view it on GitHub https://github.com/gonum/blas/issues/106#issuecomment-72707927.

kortschak commented 9 years ago

@btracey, we might want to have a blas global that is used to define parallelism in dgemm. Something that is ignored when it's zero. What do you think?

btracey commented 9 years ago

Right now we allocate one goroutine per GOMAXPROCS, and perform a block-based matrix multiplication. From what I've read, this block-based approach is faster than the naive algorithm even in serial. If Issue 9129 is fixed, then cache-oblivious algorithms will be supported, and at least on Dmitry's blog, this implementation is also faster in serial than the naive algorithm. At that point, I'm not sure what such a global would gain us, although I suppose we could implement a go-routined recursive and a non-goroutined recursive version.

kortschak commented 9 years ago

Clients may not want to allocate all GOMAXPROCS threads to Dgemm. The global would allow that.

btracey commented 9 years ago

That I understand. The accelerate framework on OSX provides that ability, though it's limiting the number of threads, which are more expensive than goroutines.

If the global flag is a binary on concurrency (Allow/Don't allow), then this is something we can implement. With the current implementation of Dgemm, we can also respect such a limit by limiting the number of workers. However, it seems like the future is for gc to support efficient cache-free parallel code. If we move to such an implementation (you can see a sketch of one https://groups.google.com/forum/#!searchin/golang-nuts/matrix$20multiplication/golang-nuts/RVX-BhHcugM/Pi6Jlj2OPNgJ ), then we either cannot enforce such a limit, or we'll hurt the scalability by introducing a central mechanism. Additionally, as programs grow in size, it seems like such a limit is meaningless in practice (there may be all sorts of concurrent calls below you). As the user, all you can really trust is that the go scheduler will try its best.

kortschak commented 9 years ago

That sketch has exactly a threshold on the number of workers.

btracey commented 9 years ago

No, Threshold specifies the smallest matrix you divide. The base case for the recursion is when all of the dimensions are small (< Threshold) so you just do the matrix multiplication.

kortschak commented 9 years ago

Yes, but you can use a threshold on recursion depth for concurrent operation.

btracey commented 9 years ago

That's true.

btracey commented 9 years ago

It would be interesting to try that for Dgemm. I tried the cache-free version, but it was way slower than the implementation we have now. I did not try it with a recursion depth limit, which would reduce the cost significantly.

vladimir-ch commented 5 years ago

Closing because this repository is no longer maintained. Development has moved to https://github.com/gonum/gonum.

Successor issues in the new repository: https://github.com/gonum/gonum/issues/184 https://github.com/gonum/gonum/issues/185