Open arteymix opened 7 years ago
Just to come back on #6, we will have separate CPU vectorized and OpenCL operations. While the former have proper introspectable operations, the latter need a full context and compiler at runtime, so it will more likely be a specialized GraphExecutor
.
I also think this should wait, but we still have to focus on a sufficiently general API to make it possible.
@rainwoodman
Experiments are progressing quite fast in wip/arteymix/gocl branch and I've pushed a couple of commits upstream to get the feature we need. I almost have a working example of passing a Vast array into the GPU and computing on it.
I'll open a PR so that you can give me some review :)
I think it would be appropriate to subclass Operation
with ClOperation
and support asynchronous invocation. The graph executor would just have to map tensors into memory and schedule the calls with the GLib main loop. It's already all integrated in gocl.
More generally, given a Gocl.Context
and a Gocl.Device
, it will be possible to call an arbitrary operation on the GPU just like we do for introspected code right now.
Gocl.Context context;
Gocl.Device device;
/* initialize context and device */
var operation = new Vast.ClOperation (context, device, File.new_for_path ("path/to/kernel/source.cl"));
operation.set_buffer_retriever ((arr) => { return cached_buffers[arr]; });
operation.invokev ({x, z});
Ideally we would map the required tensors in advance so that chaining operations would not move data.
What does the buffer retriever do?
Only the graph executor knows how Vast.Tensor
are mapped to Gocl.Buffer
, so it must somehow have a way to fetch back the buffer for any array passed to invokev
otherwise it would have to copy systematically.
Maybe only passing a weak reference of the GoclGraphExecutor
to the GoclOperation
would be better, but I'll see that once we'll have a fully working graph.
OK. An unrelated issue -- what about operations with variable number of arguments? It is useful in cases, for example, if my model can be 2d and 3d -- and the velocity of each dimension has to be calculated separately. I have ran into this problem in the python stuff I was doing at work.
Couldn't you map a single buffer and perform multiple operations on different views?
Ideally the graph executor would just send the work to the device which could then perform them simultaneously. I'll update Operation
API to deal with async stuff and error conditions and push the rest as soon as it works.
Operation with variable number of arguments should work, but you would have to define them yourself. I think one can easily subclass an existing operation and perform computation conditionally to the arguments given.
public class CustomOperation : CLOperation {
public override void invokev (Tensor[] tensors)
{
if (tensors.length == 3) { // 2 in and 1 out
base.invokev (tensors);
} else if (tensors.length == 4) { // 3 in and 1 out
}
}
}
Here's what I expect to be a typical usage of OpenCL within Vast:
cl_mem
buffer on the GPUVast.Array
around the memory withGLib.Bytes
cl/vast.h
containing basic definitions and manipulations (all inline)We can even write the kernel using
--nostdpkg
from Vala, although C is probably more appropriate. In either cases, libvala is capable of generating code for our routines.All we need to do is settle on a ABI for the memory layout of the array, not necessairly its metadata because we will have to copy them either way separately. I think we have something nice right now, so I wouldn't change anything.
The other issue is type, we have to use types that are compatible with OpenCL kernel types, which should be addressed in Numeric-GLib by providing GType definitions.
From an OpenCL perspective, we would only need inline routines in a header file to operate on the array memory. No need to manipulate views, just operate on the data consistently.
I'm working on that and also gocl that we should frankly use, it's a nice piece of software.