Surface Open CL Bindings, CUDA Bindings, or v8 GL to Node. GPU Accellerated Node.

TheLarkInn commented 6 years ago

I had slacked on submitting this request a few weeks ago. But I'd love to see bindings for GPU-accelerated be surfaced to NodeJS. Right now the story for GPU processing is pretty non-existent and I'd love to find a way to bring the power of SIMT, CUDA Thread processing and other GPU specific optimizations to NodeJS. Things like hashing, graph traversals, and complex searches all would benefit from this if leveraged with these said bindings.

@jasnell asked I bring this up in an issue again so we could see if partners could come together to help collaborate on a set of bindings.

6-8-axnw1bom81v5xa3nh48c commented 6 years ago

I filed this feature request three months ago asking for GPU acceleration Node.js and it was rejected. Comment from @TimothyGu

A quick Google search turns up headless-gl which uses ANGLE, so obviously not GPU-accelerated, but probably still faster than WebSocket etc.

My point was that once Node.js start supporting GPU natively, there would be tons of other useful stuff that could be implemented like canvas 2D and 3D rendering context, but it seems that "these sorts of things are best left to userland", as @mscdex said.

Has anything changed in the meanwhile? Is GPU support now a good idea, since there is really no userland modules which provides that functionality?

devsnek commented 6 years ago

this pr is asking (i think?) for gpu acceleration to be used in node core, not for node to expose a gpu interface. i would agree with the general sentiment that things like canvas and cuda are best left to userland.

apapirovski commented 6 years ago

I feel like the request is pretty clear...

I'd love to find a way to bring the power of SIMT, CUDA Thread processing and other GPU specific optimizations to NodeJS. Things like hashing, graph traversals, and complex searches all would benefit from this if leveraged with these said bindings.

@6-8-axnw1bom81v5xa3nh48c your issue is about facilitating some canvas capabilities (image-related) which are wholly different from the request in this issue (to put simplistically, leveraging GPU's processing power).

I don't think it's unreasonable for Node to provide a native storyline for Open CL. It does however seem like a decent undertaking and I don't know if anyone here is wholly familiar with that area?

addaleax commented 6 years ago

This would be an awesome feature to have, yes :) I think the main questions (still) are:

What would be the advantage of having this in Node core? Wouldn’t developing and experimenting with it in userland be easier for everyone?
Who’s going to build this? Whether in Node core or not, I think this is going to rely on volunteers?

ChALkeR commented 6 years ago

There were some OpenCL bindings and WebCL implementations in the userland (and CUDA bindings, too), but they are not very popular (or actively developed) afaik. Still this probably belongs in userland.

TheLarkInn commented 6 years ago

I'd be surprised if GPU vendors didn't have interest in helping develop this. Essentially enables a bajillion new developer hands messing with hardware acceleration. (I guess this is pretty speculative though).

I'm happy to provide a list of existing repo's that once attempted to work on this (in addition to their browser only counterparts (like GPU.JS) if that is remotely useful). Ultimately it would be a bit of work to standup a fully featured OpenCL binding, but maybe instead a simple set of primitives like parallization, etc. that could be a good start.

Either way. ++10000 all these responses.

this pr is asking (i think?) for gpu acceleration to be used in node core, not for node to expose a gpu interface. i would agree with the general sentiment that things like canvas and cuda are best left to userland.

Honestly why not both?

devsnek commented 6 years ago

if node makes an opencl core module (probably modeling webcl) i'm sure people would use it, i just worry about how we could make a "nodesque" interface around something that is traditionally very low level without making it too general to be useful.

kgryte commented 6 years ago

I fail to see why GPU bindings should be part of core. Core provides what is already needed to create bindings: native add-ons. If people want GPU bindings, those individuals can write and maintain those bindings themselves.

For core to support this, it would require a substantial investment for less than predominant use cases, particularly in terms of providing support for multiple GPU languages (e.g., CUDA, OpenCL, etc) and testing across multiple device types which will have varying quirks, levels of support, etc. Couple this with the expected churn introduced by Vulkan and independent work at the web standards level wrt WebGL 1, WebGL 2, and compute shaders and you have a veritable maintenance mess.

In the end, for this to be even remotely viable IMO, you would have to settle on particular language support (i.e., CUDA) and particular device support (Nvidia X, Y, and Z). Doing so will of course limit general applicability and will be specialized enough to question why it is part of core in the first place.

gireeshpunathil commented 6 years ago

One of the perceived consumability issues for user land native addons is the lack of / incompatibility with C++ build tools in the field production box - a compiled and bundled object code is more reliable than one that is built-on-the-fly. Example: an NVIDIA dev. library may not be present in the client system or may depend on other third-party libs. So I guess the stregth of the use case may be used as the main criteria for deciding userland vs. core, as opposed to the usual suspect arguments.

jasnell commented 6 years ago

I would love to see experimentation on this move forward. The question of whether it should be in core or not can be determined later. For example, I started the work on the http2 impl long before it was decided whether it would be a core API and initially wrote it in a way that would allow it to be separated out if necessary. A similar approach could be taken here.

I can definitely think of a few amazing cases for offloading to GPU... Better crypto and rendering off the main thread are excellent examples. But before we get ahead of ourselves, let's get some experimentation going to see what's reasonably possible.

joyeecheung commented 6 years ago

I think utilizing GPU whenever possible in core would be nice, but providing and supporting a user-facing API would be much more difficult, basically because of what https://github.com/nodejs/node/issues/18423#issuecomment-361136309 said. cc @nodejs/build

kgryte commented 6 years ago

Rather than hand waving arguments, I would prefer that when we speak of amazing cases we have actually some concrete examples. @jasnell If you can point to commonly used, highly portable GPU libraries for cryptography, I would be interested. I believe the reality, however, is that these are fewer and more far between than commonly believed, and the use of GPUs for cryptographic applications is still an area of active academic research. In fact, apart from brute-forcing crytographic algorithms, a number of cryptographic algorithms are not easily parallelizable, making the GPU a less than ideal target.

I would also be interested in knowing what is meant by rendering off the main thread, the actual use cases for such rendering, and how this cannot already be achieved with workers/child processes coupled with a native add-on. And also, note that rendering is more than likely to require OpenGL, not Open CL or CUDA. So, if rendering is a target use case, along with general purpose compute, add another language which will mean a significantly increased API surface area.

In general, as many a front-end dev who has written applications targeting WebGL can attest, writing portable GPU code is hard. Any anointed bindings will have to contend with the very real and very frustrating process of creating stable, robust APIs. Obviously, if this was achieved, this would be awesome, and my work in scientific computing would benefit greatly. However, I recommend we be rather clear-eyed about the nature of GPGPU and the known, as a matter-of-fact, use cases why experimentation should be pursued.

Personally, if you want to make userland experimentation more viable in this regard, the biggest step IMO that core can take is to remove GYP and replace it with something "better". The reality is GYP is primarily oriented toward C/C++ toolchains, deprecated, and has bad/incomplete documentation. From my experience writing Node.js bindings to high-performance numerical computing libraries written in Fortran, GYP has to be shoehorned into doing anything not C/C++ (such as targeting non-C/C++ compilers like gfortran). Couple this with GYP being tied to Visual Studio on Windows and anything non-C/C++ on Windows is not possible (without considerable hacking), due to compiler support.

So, before Node decides to go down the long road of GPU bindings, I would suggest its priorities be focused on creating a solid enough and general enough toolchain (yes, N-API is a good start) that userland can experiment to its heart's content and core never has to concern itself with (the mess that is) GPU.

TheLarkInn commented 6 years ago

Personally, if you want to make userland experimentation more viable in this regard, the biggest step IMO that core can take is to remove GYP and replace it with something "better". The reality is GYP is primarily oriented toward C/C++ toolchains, deprecated, and has bad/incomplete documentation. From my experience writing Node.js bindings to high-performance numerical computing libraries written in Fortran, GYP has to be shoehorned into doing anything not C/C++ (such as targeting non-C/C++ compilers like gfortran). Couple this with GYP being tied to Visual Studio on Windows and anything non-C/C++ on Windows is not possible (without considerable hacking), due to compiler support.

For trying to get a stdlib perhaps. But this thread is about GPU.

Now there are some hardware acceleration techniques that piqued my interest. One here (can cite original document):

img_20180129_014525 Essentially doing a JIT where code would be appropriately optimized, and then fall back to normal JS execution.

In terms of concerns over surfacing bindings, most developers are looking for parallism solutions. I'd say scoped to just this would beneficial.

Disclaimer: there are probably more traditional examples of this, that does seem so esoterically schollarly but this was the first one I stumbled upon which provided some inspiration.

bnoordhuis commented 6 years ago

@TheLarkInn @jasnell Can you either provide some actionables or move the discussion to e.g. the node-eps repo?

Essentially doing a JIT

The JIT compiler is V8 territory, not Node.js.

jasnell commented 6 years ago

The actionable is simple enough: just like we did with other things like http2, n-api, node-chakracore, etc, we can easily create a fork off master to experiment and work with this stuff. It might go no where, it might be the best thing ever... Who knows? It's still way too early to say and any talk of landing it in core or not is premature until it can be demonstrated to work and be useful.

We can keep this issue open as a feature request and move forward in the other repo with specific code. Those who are interested in helping, ping me and I'll get a call together.

I would suggest its priorities be focused on creating a solid enough and general enough toolchain

Just keep in mind that this is not a zero-sum game. We can do multiple things at once.

Tiriel commented 6 years ago

@jasnell Just a quick not: for those of us who may not feel up to the task but still want to follow what's going on, would it be possible to get the test-repo's link when it's ready?

cjihrig commented 6 years ago

Working on this in a fork of core implies that this will go into core. It should be done in userland unless it cannot be done in userland. I foresee this being added as an experimental feature because "it's better than the existing GPU support in core" which is literally nothing at all.

bnoordhuis commented 6 years ago

The actionable is simple enough: just like we did with other things like http2, n-api, node-chakracore, etc, we can easily create a fork off master to experiment and work with this stuff.

Those examples you name all have something in common that this feature request does not: a clear end goal. The one concrete suggestion that's been put forth - JIT compilation - is not within our remit. Please come up with something else.

jasnell commented 6 years ago

@cjihrig ...

Working on this in a fork of core implies that this will go into core

Not necessarily. I worked with the http2 implementation for quite some time in a fork of core before it was determined for certain that it would land there.

No one is currently arguing that this needs or even should land in core, but surely there's absolutely zero harm in discussing the possibilities right? And the unsubscribe button over in the margin of this issue gives a reasonable capability for those who may not be interested in following along in this the ability to ignore it. It's perfectly fine for folks to be skeptical about the benefits, but let's not needlessly discourage experimentation, especially when it costs you very little (e.g. no one is demanding that you pay it any attention at all).

cjihrig commented 6 years ago

And the unsubscribe button over in the margin of this issue gives a reasonable capability for those who may not be interested in following along in this the ability to ignore it.

I'm not sure if you're telling me to unsubscribe, or responding to @Tiriel, but I think @Tiriel was saying that they want to follow along.

but let's not needlessly discourage experimentation

No one is discouraging experimentation. However, a general trend in this thread is that the experimentation should happen first in userland. It does seem like creating a compiled addon would be faster and simpler than involving all of core. Once something exists, we could discuss whether it belongs in core or not.

Tiriel commented 6 years ago

Sorry for my bad English!

I'm not sure if you're telling me to unsubscribe, or responding to @Tiriel, but I think @Tiriel was saying that they want to follow along.

That's exactly what I was saying, sorry for the possible misunderstanding!

gibfahn commented 6 years ago

surely there's absolutely zero harm in discussing the possibilities right? we can easily create a fork off master to experiment and work with this stuff

No harm in discussing the possiblities, but a nodejs/node-gpu fork has an air of "officialness" to it that seems unwarranted, and I'm not sure what benefit it brings.

Why not discuss here, and then if someone wants to start working on it they can fork node themselves and try it (or just write a native addon which seems easier). Then others can chime in and be added as collaborators to that repo etc, and if it ends up being viable then we can discuss bringing it into core.

tl;dr +1 on having this issue to gather like-minded people, -1 on a nodejs/node-gpu fork.

jasnell commented 6 years ago

I don't believe anyone suggested creating a nodejs/node-anything fork. It doesn't have to live in this org.

In any case, the discussion here has run it's course. For the folks who are interested in pursuing this, I recommend following an approach similar to what I did with http2: work on it in a fork for a bit until you have something more concrete, then bring it to the table as a PR once it's been proven out (assuming it does).

gibfahn commented 6 years ago

I don't believe anyone suggested creating a nodejs/node-anything fork. It doesn't have to live in this org.

Oh, then yeah zero harm in discussing it.

work on it in a fork for a bit until you have something more concrete, then bring it to the table as a PR once it's been proven out (assuming it does).

👍 , but feel free to comment here with a link to a repo before it's been proven, sounds like there are people who would like to help out.

YurySolovyov commented 6 years ago

@jasnell for http2 it was fine because node has raw sockets and http2 does not require new runtime (V8) features.

If the request is to make it possible to run JS functionss on the GPU, @TheLarkInn needs to talk to V8 or WASM (i.e. runtime owners) people.

TheLarkInn commented 6 years ago

I will provide a more actionable list of ideas here when I get a chance ^^.

On Thu, Feb 1, 2018, 4:59 AM Yury notifications@github.com wrote:

@jasnell https://github.com/jasnell for http2 it was fine because node has raw sockets and http2 does not require new runtime features.

If the request is to make it possible to run JS functionss on the GPU, @TheLarkInn https://github.com/thelarkinn needs to talk to V8 or WASM (i.e. runtime owners) people.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nodejs/node/issues/18423#issuecomment-362258420, or mute the thread https://github.com/notifications/unsubscribe-auth/ADQBMEhiYPMGen4SDe9jtkLaVn9zjcTmks5tQbVDgaJpZM4Rv65e .

hashseed commented 6 years ago

Regarding native support in V8... I don't think there is any high-level general purpose language that includes GPU support transparently, much less a single-threaded JavaScript. There has been attempts to include SIMD features to JavaScript but were dropped because, iirc, the speedup in prototype implementations were a lot less than expected, the use cases very narrow, and not particularly platform-independent.

My personal impression is that the same problems that plague Workers (long startup time, memory overhead, slow message passing, etc.) would affect GPU acceleration, likely worse.

I could imagine a future where wasm supports this transparently, but I would not hold my breath in the near future.

gireeshpunathil commented 6 years ago

I don't think there is any high-level general purpose language that includes GPU support transparently

IBM Java does this, and also openj9

the rate of performance benefit at huge data volumes will outweigh any drawbacks I guess.

kgryte commented 6 years ago

@gireeshpunathil Except that the GPU support you cite only seems to apply to Nvidia graphics cards. Thus, once again, raising concerns about the portability, generality, and feasibility of such an approach for Node.js and any potential role in core or as an officially sanctioned project.

I am also not sure what you mean by "huge data volumes". The issue with large datasets is the bottleneck encountered when transferring data from CPU to GPU. The point alluded to regarding V8 is that you will get a significant perf hit waiting for data to be passed from a Node.js process to a GPU, including blocking the event loop, and, in fact, that perf hit can be significant enough to nullify any potential performance benefit. I have personally experienced this hit going from GPU to CPU when writing simulations. And you can experience this yourself today on the web when passing data off from the browser runtime to the GPU for graphics processing. Thus the point regarding the problems affecting workers is rather salient.

So while, yes, IBM may offer Java flavors with support for particular GPUs (Nvidia; same applies for MATLAB), this does not equate to "high-level general purpose language", but rather a vendor specific implementation with asterisks. The reality is that any person seriously considering whether it is a good idea for the Node community to devote time and resources to creating "official" bindings needs to recognize the rather messy reality of GPU bindings and the fact that their upkeep and maintenance will be a significant drag. But maybe people need to learn this themselves.

gireeshpunathil commented 6 years ago

@kgryte - thanks. While languages (and virtual machines) with blocking I/O find it complex enough to implement the CPU-GPU binding (prepare, transport and synchronize) the biggest advantage node.js have is the language semantics (inherently asynchronous), and the event loop which can cater to the CPU-GPU handshake well within its stated design - apart of polling the OS, we will also need to talk to the chip to see the I/O readiness in one of the embodiment.

If you have experienced large memory copy throttling the single-thread as a major concern, helper threads can be considered as alternatives for in-memory copy.

gireeshpunathil commented 6 years ago

re: portability and generality : I guess that has to come from the card vendors - in terms of standard APIs and well defined specification - I don't know there exist one. Absence of that is not necessarily block the progress of building a proof of concept.

hashseed commented 6 years ago

IBM Java does this, and also openj9

I would not call having to import com.ibm.gpu.* and call specially defined methods is hardly "transparent". What I understand by "transparent" is that the compiler/vm will recognize when it can compile certain code patterns to run on the GPU.

For huge data volumns, I don't see why GPU support cannot be implemented in a userland module, like previously mentioned, a module that compiles a subset of JavaScript to ship off to the GPU.

Looking at the IBM example, I don't see why you can't implement a native module that offers a GPU-backed sort, e.g.

var gpusort = require("gpusort");
gpusort([1, 2, 3, 4])

YurySolovyov commented 6 years ago

Slightly different angle (and possibly off-topic): instead of using WASM as a compilation target, can we transpile to WebGL/OpenGL (compute) shaders? How much of impedance mismatch (i.e. not having the right primitives) will be there?

hashseed commented 6 years ago

Slightly different angle (and possibly off-topic): instead of using WASM as a compilation target, can we transpile to WebGL/OpenGL (compute) shaders? How much of impedance mismatch (i.e. not having the right primitives) will be there?

Sounds feasible, for a subset of JavaScript. Can be done in a userland module. And the user will have to choose the right part of his code to run the GPU.

gireeshpunathil commented 6 years ago

var gpusort = require("gpusort"); gpusort([1, 2, 3, 4])

while this is a perfect consumption model from a user's angle, the binding may require high degree of co-ordination with v8, node core and libuv so core may be a better place for it, IMO.

hashseed commented 6 years ago

while this is a perfect consumption model from a user's angle, the binding may require high degree of co-ordination with v8, node core and libuv

Can you elaborate? Considering that JS is single-threaded and this API blocking, I don't see what coordiation is necessary.

ofrobots commented 6 years ago

Tangential: I have no comment on this particular request about exposing GPU bindings one way or the other, but the comment 'why can't X be done as a native module' shows up frequently.

The user experience of native modules is terrible not just for the developers for native modules but also for the end-users (people who want to use X from JavaScript) as they have to ensure that the native module can be correctly used in all environments. Not to mention the complexities around being able to audit a binary blob fetched over the internet. If http were to be a native module, Node.js would not have gained the same popularity.

jasnell commented 6 years ago

While true, a quick native module proof-of-concept that does not need to focus so much on user/dev experience for the initial iteration would be a good first step in validating the concepts here. That can be implemented in a way that would make it easy/easier to integrate into core itself should it prove feasible.

The question about whether or not this can/should be a userland module in the end is premature. Let's first find out (a) if there's actual benefit to doing this and (b) what's involved and how difficult it is to do. The choice of whether it should be in core really depends on the answers to those two points.

ofrobots commented 6 years ago

The question about whether or not this can/should be a userland module in the end is premature

+1. That was my point too. Lets focus on the value.

hashseed commented 6 years ago

Right. I'd just like to steer the discussion away from having this baked into the VM. This doesn't seem feasible and even if, will not happen without TC39 approval. And locking into a vendor-specific way to perform GPU computation is very unlikely.

gireeshpunathil commented 6 years ago

@hashseed - rough GPU sketch for gpusort([1,2,3,4,5]): (CUDA spec).

function gpusort(ar, sorter) {
  // 1. extract the argument
  // 2. create an off-heap memory
  // 3. copy the argument into the off-heap area
  // 4. copy the off-heap data into CUDA memory
  // 5. generate native code for sorter, assuming single thread and for the GPU target arch.
  // 6. flash the code into the card (I guess there exists driver APIs for this)
  // 7. invoke the generated code
  // 8. the GPU parallelizes and runs the code 
  // 9. reverse the copying of data (result), and memory cleanup
  // 10. return result to the caller.
}

if we use pre-defined C/C++ routines only (like in Java example), there is no transperancy and generality
so assuming the sorter is a JS routine, step 5 needs a JS compiler for GPU target (and hence v8)
if we are using small data, the benefit is masked by the overhead, as evident
if we use large data, the thread blocks, and the benefit masks concurrency in a single thread (as empirically observed by @kgryte)
so either we use helper threads, or make it asynchronous, with the help of epoll (and hence node / libuv)

hashseed commented 6 years ago

None of these steps can be skipped by V8 either. And since we are talking about a subset of JS that has a GPU target, it would not share much code with V8.

For gpusort you could actually ship pre-compiled GPU code and skip compilation.

I also don't get why you assume single thread for the sorter. I don't think it ever makes sense to run things on the GPU on a single thread,.

gireeshpunathil commented 6 years ago

@hashseed - sorry, but I could not follow your first part, can you please explain?

For gpusort you could actually ship pre-compiled GPU code and skip compilation.

right, but then the scope becomes narrow. When we claim GPU bindings for javascript, one would expect ability to run arbitrary user code in the chip?

I also don't get why you assume single thread for the sorter. I don't think it ever makes sense to run things on the GPU on a single thread,.

ok, if you are referring to step 5 in the above comment: to clarify - the programmer and the compiler assumes the code to be run single-threadedly, but when it is launched in the processor, it paralallelizes the code. Meaning, we don't provide any logic in the code to make it parallel, explicitly.

YurySolovyov commented 6 years ago

when it is launched in the processor, it paralallelizes the code. Meaning, we don't provide any logic in the code to make it parallel, explicitly.

This thing alone is a separate research topic; we don't even know how to do that in low level languages that run on CPU, so it is not realistic to expect that to happen for JS code on the GPU

hashseed commented 6 years ago

@hashseed - sorry, but I could not follow your first part, can you please explain?

The steps you listed are how things work with CUDA. Even if V8 can automatically find sections of JS that it could off-load to the GPU, it would still have to perform these steps.

Besides, I don't think it's reasonable to expect V8 to automatically off-load to the GPU.

gireeshpunathil commented 6 years ago

@hashseed - thanks, now I follow you, and agree that v8 is not expected to interact with GPU. But, isn't v8 best placed to generate code for a GPU target, from JS code?

hashseed commented 6 years ago

V8 doesnt have any backend for GPU targets. And I don't think GPUs are well-served with a dynamic language such as JS, so you'd limit yourself with a subset of JS.

So you really don't win a lot by sharing code with what V8 already has.

YurySolovyov commented 6 years ago

FWIW, compute shaders seem to be in plans after WebGL 2.0 (which is shipped in Chrome 56+, I think) Anyone have contacts with people in The Khronos Group ? 😄

But even then, node will have to incorporate WebGL runtime to be able to use it.

TheLarkInn commented 6 years ago

@YurySolovyov I actually may have one.

bnoordhuis commented 6 years ago

This issue seems to have stalled and since it wasn't very actionable to start with, I'll go ahead and close it out.

nodejs / node

Surface Open CL Bindings, CUDA Bindings, or v8 GL to Node. GPU Accellerated Node. #18423