blas,mat: consider changes necessary to allow CUDA/OpenCL backend

kortschak commented 7 years ago

This is something that has been discussed on an off over the past few years, but we probably should devise an approach before we get too far along.

There are CUDA/OpenCL BLAS implementations that we could use, but we likely lose much of the benefit of the GPU if we hold the matrix in system memory, transfer to the card, perform the operation and return it to the system for each matrix operation. Ideally, we would want to move the data across, perform multiple operations and only return the results to the main system when completed.

The upshot of that is that we want a structure that is similar to the blas handle-like mat fields in the mat package, but with knowledge of where the data is actually held, and whether it is dirty. User access to the data would wait on GPU completion if the data were being operated on by the GPU and non-GPU writes (including Set* method calls) would mark the data as dirty (after having waited on any GPU operations and returned the data from the card).

This set of requirements is inconsistent with the current dumb types in blas64 which make up the mat fields in the mat package. What seems to be needed is a change from mat blas64.T to mat blas64.I and addition of necessary methods. With this, a GPU-aware blas package could be built. I'm concerned though about the depth of interface layering that we are building up.

btracey commented 7 years ago

What seems to be needed is a change from mat blas64.T to mat blas64.I and addition of necessary methods.

I do not understand

kortschak commented 7 years ago

We make the field an interface value rather than a concrete value. Off the top of my head, the simples way to do it is to make each of the mat fields the appropriate Raw* interface and make the data types implement those interfaces. That way, a CUDA/OpenGL implementation can wrap the blas64 data type and hold the relevant handles and dirty flags.

kortschak commented 7 years ago

Having slept on this, I don't think this should happen in the blas64 interface layer, but rather at the level the mat package operates, but in another package. The logic required to keep consistency between the implementation and the blas-CUDA/OpenCL handle would add too much weight both from a maintenance perspective and to performance of non-GPU users.

So the new suggestion is to add a cuda and/or opengl package that implements (in the first instance) a mat.Matrix based on blas64.General. This implementation would hold the data location and dirtiness status, and be a mat.RawMatrixer to facilitate performant interop with mat. Initially, as a PoC, it would implement Add, Mul, MulElem and Sub.

One issue that needs addressing for this work is that CUDA (at least) takes column-major matrices, so the work in https://github.com/gonum/lapack/pull/69, or something like it would need to be merged first.

btracey commented 7 years ago

Could you write up some kind of proposal document? I understand that you've given a sketch here, but I don't understand how it works, and what the costs are (both real and in terms of complexity).

In particular, I don't understand which parts of this are in gonum and where, and how we can get away with the data living in the GPU. For instance, in something like a neural network, typically there are some matrix multiplications to compute the prediction, and then again to compute the gradient. However, the elements of these matrices are then updated with some kind of stochastic gradient descent. Would we have to also modify gonum/optimize (or whatever future package we have) to also be able to do these modifications on gpu? That seems like we would need to change away from []float64 being the core type, and to some kind of gonum.Vector being the core.

The broader point is to see some examples of what would be required to do effective GPU computation, and how realistic that goal is within mat, as opposed to, say, providing some cgo to work with cublas, and let users optimize as normal.

kortschak commented 7 years ago

Could you write up some kind of proposal document? I understand that you've given a sketch here, but I don't understand how it works, and what the costs are (both real and in terms of complexity).

At this point probably not. I need to have the code in #141 merged at a minimum, but that is useful more generally, so I don't see a major cost for that. After that, I expect this is going to be a process of significant exploration. I do have a good idea of the broad implementation outcomes that I need, but the finer details are not completely clear to me, with the exception that it is unlikely that we will get a free drop-in for our current types in all places throughout the suite. I think that the description you have in the last paragraph is the most likely outcomes, though even more constrained than the cgo case; I expect a complete parallel mat.Matrix implementation will be necessary that maintains the data, the implementation (per value, not per package if I add more than one platform - initially I think I will only target OpenCL) and the accounting for data location.

btracey commented 7 years ago

My concern is that maybe the abstractions are too different to be able to effortlessly use GPUs with mat. This has at least been my understanding in different languages. Typically one has to code differently in the two cases, and they aren't really interchangeable. If we can be different, that would be great, but I don't yet see it.

I think this divide is real with (general) sparse matrices. They have to (more-or-less) exist in their own abstractions, since in general dense/sparse operations just become dense. For spares matrices, we shouldn't figure out how to shoehorn them into mat, instead we should find the operations to support (linear programming, large scale linear solve, etc.) and find the right interfaces to support both dense and sparse operations.

If we can make it drop in without making the code to complicated to reason about, then great. If not, I think we can provide GPU support without changing mat. I don't know which side of the line GPUs fall on

kortschak commented 7 years ago

With the additions to blas I have partially completed, and the change of vector API I think I can go ahead with a PoC repo (I'll work in github.com/gonum/opencl as gonum.org/v1/opencl).

This work will not provide drop-in replacement for use in client packages in gonum, but will allow users to do GPU computation and with minimal effort change our code to allow GPU computation in some cases.

btracey commented 7 years ago

Could you post this proposal in the mailing list? It's good to discuss in the more public space before new repos are created.

kortschak commented 7 years ago

This could potentially be one of two proposals.

Proposal for an opencl package implementing an opencl-backed mat.Matrix with initial minimal arithmetic operations,
Proposal for an exp repo where things like this can be done in an explicitly unstable way.

Or both.

gonum / gonum

blas,mat: consider changes necessary to allow CUDA/OpenCL backend #138