rainwoodman / vast

vala and scientific numerical computation
11 stars 1 forks source link

OpenCL support #41

Open arteymix opened 7 years ago

arteymix commented 7 years ago

Here's what I expect to be a typical usage of OpenCL within Vast:

  1. allocate a cl_mem buffer on the GPU
  2. map and wrap a Vast.Array around the memory with GLib.Bytes
  3. fill-in the data
  4. create an OpenCL context and program
  5. compile a kernel against cl/vast.h containing basic definitions and manipulations (all inline)
  6. pass the buffer to the kernel and execute it
  7. retreive the computed data (or keep it there for another computation)

We can even write the kernel using --nostdpkg from Vala, although C is probably more appropriate. In either cases, libvala is capable of generating code for our routines.

All we need to do is settle on a ABI for the memory layout of the array, not necessairly its metadata because we will have to copy them either way separately. I think we have something nice right now, so I wouldn't change anything.

The other issue is type, we have to use types that are compatible with OpenCL kernel types, which should be addressed in Numeric-GLib by providing GType definitions.

From an OpenCL perspective, we would only need inline routines in a header file to operate on the array memory. No need to manipulate views, just operate on the data consistently.

I'm working on that and also gocl that we should frankly use, it's a nice piece of software.

arteymix commented 7 years ago

Just to come back on #6, we will have separate CPU vectorized and OpenCL operations. While the former have proper introspectable operations, the latter need a full context and compiler at runtime, so it will more likely be a specialized GraphExecutor.

I also think this should wait, but we still have to focus on a sufficiently general API to make it possible.

arteymix commented 7 years ago

@rainwoodman

Experiments are progressing quite fast in wip/arteymix/gocl branch and I've pushed a couple of commits upstream to get the feature we need. I almost have a working example of passing a Vast array into the GPU and computing on it.

I'll open a PR so that you can give me some review :)

arteymix commented 7 years ago

I think it would be appropriate to subclass Operation with ClOperation and support asynchronous invocation. The graph executor would just have to map tensors into memory and schedule the calls with the GLib main loop. It's already all integrated in gocl.

More generally, given a Gocl.Context and a Gocl.Device, it will be possible to call an arbitrary operation on the GPU just like we do for introspected code right now.

Gocl.Context context;
Gocl.Device device;

/* initialize context and device */

var operation = new Vast.ClOperation (context, device, File.new_for_path ("path/to/kernel/source.cl"));

operation.set_buffer_retriever ((arr) => { return cached_buffers[arr]; });

operation.invokev ({x, z});

Ideally we would map the required tensors in advance so that chaining operations would not move data.

rainwoodman commented 7 years ago

What does the buffer retriever do?

arteymix commented 7 years ago

Only the graph executor knows how Vast.Tensor are mapped to Gocl.Buffer, so it must somehow have a way to fetch back the buffer for any array passed to invokev otherwise it would have to copy systematically.

Maybe only passing a weak reference of the GoclGraphExecutor to the GoclOperation would be better, but I'll see that once we'll have a fully working graph.

rainwoodman commented 7 years ago

OK. An unrelated issue -- what about operations with variable number of arguments? It is useful in cases, for example, if my model can be 2d and 3d -- and the velocity of each dimension has to be calculated separately. I have ran into this problem in the python stuff I was doing at work.

arteymix commented 7 years ago

Couldn't you map a single buffer and perform multiple operations on different views?

Ideally the graph executor would just send the work to the device which could then perform them simultaneously. I'll update Operation API to deal with async stuff and error conditions and push the rest as soon as it works.

arteymix commented 7 years ago

Operation with variable number of arguments should work, but you would have to define them yourself. I think one can easily subclass an existing operation and perform computation conditionally to the arguments given.


public class CustomOperation : CLOperation {

    public override void invokev (Tensor[] tensors)
    {
        if (tensors.length == 3) {           // 2 in and 1 out
            base.invokev (tensors);
        } else if (tensors.length == 4) { // 3 in and 1 out
        }
    }
}