Feature: Optional CUDA Support

WrathfulSpatula commented 4 years ago

I'm a big fan of the OpenCL standard, and I've said in the past that basically none of the proprietary GPU acceleration standards appeal to me, with the specific possible exception of CUDA. Would our user base benefit from an option to use CUDA as an alternative to OpenCL?

With WSL2's GPU support limited to a hacked-around CUDA, for now, I'd like still like to support WSL2. (Native C++11 support is already there, but without GPU acceleration.) Believe it not, this would probably not be a Herculean restructuring of the Qrack library, as basically everything outside of QEngineOCL is "layered" without any knowledge of an OpenCL dependency. Basically, all that's necessary is to write something like a "QEngineCUDA," which would immediately be interchangeable with QEngineOCL, anyway.

Thoughts? Comments? Objections?

twobombs commented 4 years ago

Sorry for the late reply, got an exam coming up :)

The implementation of CUDA for Qrack would make life for implementing Qrack for Docker a whole lot easier because of the existing support for CUDA in Docker. In OpenCL I need to install drivers for the different OpenCL vendors inside the container that might or might nor work fine in concurrency with other containers and/or drivers installed on the baremetal host. So CUDA support for that POV would be great.

On the WSL2 side of this everything is awesome in that afters years of waiting linux kernel support has emerged on the windows platform ( again good news for Docker usage scenarios ) WSL2 however, does come with a penalty in the form of a performance tax. A full monitored k8s stack with rancher with istio and cuda support clocks in at ~10% idle on a 12 core bare metal machine, on WSL2 it sometimes even ticks 80% in 'idle', also because of noisy windows back and foreground processes.

The performance penalty is mostly there because of the emulation layer in WSL2. However. When using OpenCL and/or CUDA on GPU devices this performance pentalty doesn't really matter as the running thread might max out one CPU core but stream all data and calculations to and from the GPU wich at that point does all the additional work. So the idea IMHo is a good one, just don;t expect stellar topnotch performance on WSL2 when compared to native baremetal linux with CUDA.

In either scenario I'm very interesting in progress and compare benchmarks between OpenCL and CUDA when code matures as it would compare apples to apples. And that is a very rare and precious thing in OpenCL / CUDA land. I'm willing to use my time and test equipment to run the code on either platform. ( Tesla K80 on AMD 12/24 core machines )

WrathfulSpatula commented 4 years ago

(Good luck with your exam!)

One of the major reasons Qrack chose OpenCL over other multiprocessor APIs, compared to CUDA for example, was specifically its open standard, ostensibly improving or ensuring general public access to the resource and utility of Qrack. With solid support for OpenCL, (including native on Windows systems, by the way,) I find myself asking: are we missing opportunities to enable further general access? Is anyone left out in the cold? WSL Ubuntu is definitely a "market" segment I'm personally interested in, but supporting CUDA where OpenCL support might be limited seems like an even rounder segment served by the same development priorities.

Of all the proprietary standards of which I am aware, (DirectX, Metal...) CUDA seems to be the single best candidate for increasing access for a broader audience. As far as I'm concerned, it's fine to integrate with commercial and closed products, within the terms of the LGPL, so long as it benefits general access.

So, it might take a while for the CUDA support to mature, but I might even have a "QEngineCUDA" drafted by Monday. I'm glad to hear you'll likely use this, @twobombs. (So will I!)

WrathfulSpatula commented 4 years ago

I'm going to adapt the OpenCL QEngine into a CUDA version, and I think the task is mostly a syntactical simplification thereof, but basically we need a C++11 STL implementation of a serial background queue with optional in-queue callbacks, (which is not particularly difficult). I suspect that CUDA is "thread-safe" in the same sense as OpenCL, so we'll probably benefit from adhering to the async model of the low level implementation, which has already been highly scrutinized.

WrathfulSpatula commented 4 years ago

As I dig in with this, I'm updating my assessment of the requirements. I doubt we need the queue I mentioned above; I doubt we need anything we don't already have. (Almost) everything the OpenCL API standard offers, CUDA does as well. It's nearly possible to implement the exact same behavior with library-respective API calls exchanged line-for-line. However, I am not yet a CUDA programmer, (though knowing OpenCL goes a long way). Rather than implement the CUDA engine practically from scratch and build it up to the performance of the OpenCL engine, the target should be to migrate basically one-to-one between OpenCL and CUDA features for the QEngine. To do this, I have to reach deep into "advanced" CUDA programming and APIs. It looks like CUDA makes a high level implementation much easier than OpenCL, but our concern is maximum performance, so we need to "reach deep." The OpenCL implementation is probably close to the best of what CUDA can do, with some nuance.

The draft might not be finished tonight, but this is top priority; honestly, I'm only diverting into other features and improvements in progress to avoid boredom.

WrathfulSpatula commented 4 years ago

(The CUDA kernels already compile, by the way, but I can't say that they aren't bugged, without host code to test them.)

WrathfulSpatula commented 4 years ago

I'm reading about CUDA multi-GPU programming: I think we actually can and should develop this in two stages. For the first stage, QEngineCUDA will support different CUDA GPUs per QEngine instance, but it will rely on the unified memory space for inter-device communication. That is, for the first development stage, we will not support inter-op between devices that cannot share the same device memory space, (but we will support those that can). The second stage will be nice to have, but I think the first stage gives us access to features like NVLink, already, and this starts out significantly less complicated.

twobombs commented 4 years ago

MultiGPU is kinda EOL, except for the Quadro and V100/A100 series GPU. SLI in 2070/2080 does not have unified memory, textures are synced between cards as to render the scene. This might or might not be usefull for Qrack. With the speeds of memory and pcie4 ramping up nowadays it might be interesting to do batches from memory to the GPU instead, but that's just me rambling. I liked SLI back in the day. Would be great to see it back in this project, but onboard memory and disk speeds are catching up fast.

Kinda hyped up for CUDA implementation for Qrack. Feel it's a good direction.

( I passed the exam :) )

WrathfulSpatula commented 4 years ago

(Congratulations! Good work!)

In case you can't tell, I'm not familiar with the CUDA paradigm, yet, except for any resemblance to OpenCL. It seems to me like CUDA has a lot of nice features that make it easy to work at a higher level than OpenCL for simple hardware scenarios, but the domain motivates the performance advantages of the low level approach, and it can actually be harder to use the high level CUDA API to accomplish the nitty-gritty of what's fairly standard in OpenCL. It's really not hard to similarly limit program and kernel complexity in OpenCL by limiting your features and support to what CUDA affords you most readily, but then the high-level API becomes a little bit of barrier. Paradoxically, it feels at first like my code is less coupled to my hardware with OpenCL than with CUDA, even leaving aside the proprietary vs. open standard. That's me griping like an old man, though.

I'll take what you're saying into consideration. Maybe, at least, the CUDA QEngine can be stripped way down compared to the OpenCL QEngine, then, with no great loss compared to expectation for the paradigm/API. My biggest hesitation is the developer maintenance cost of the two very-closely-related engine variants. I'm committed to producing something for CUDA, but the best opportunity for the CUDA engine to benefit from our experience with OpenCL is to try to get as close to a one-to-one port for features at the get-go, and then they become a bit difficult to maintain in parallel, the further they diverge. I'll have to do more research, before I fully commit to a design.

twobombs commented 4 years ago

I've been following CUDA projects for some time, NVidia likes to offer their customers and users an abstraction that is oriented towards viewers, visual editors for enhancement of their ingame effects. CUDA cores are in my view multifunction, versioned ASICs with a relative fat core with a whole lot of high level abstraction in them. ( eg: Ageia )

IMHO it's very much geared towards topics of interest 'du jour' that are in the microcode of a newer version that need to be brought to market in a timely fashion.

Interesting that you mention that maintaining 2 codebases would be unwise, because I was thinking about a wrapper that has code running here on github. https://github.com/vtsynergy/CU2CL The functionality would be that, in reverse. I feel a lot of inner workings of CUDA code can be gleaned from that project.

I totally agree that maintaining two codebases would create unwanted deltas and quite possibly even more crossplatform regression that would be no fun to debug with a myrad of devices, a different ecosystem, so a wrapper might be a fitting solution for that until such time that a code choice has or hasn't been made.

no rush, indeed.

WrathfulSpatula commented 3 years ago

I want to open this up for consideration for the Unitary Fund hackathon. If you are a CUDA programmer, this could be one of the best features anyone could add to Qrack that is somewhat out of my ken.

See the branch at https://github.com/vm6502q/qrack/tree/qenginecuda. Also read @twobombs last comment about the potential for wrapping an OpenCL implementation as a CUDA implementation, optionally.

I'm happy to offer and maintain a CUDA build option, particularly if there is demand. I'm impressed by Qiskit Aer's performance on GPU benchmarks, which appear to be CUDA based. By intention, we use a combination of C++ std::future and OpenCL, for maximum portability and minimum dependency. Provisionally, it seems like Qrack suffers less penalty for low width and low depth circuits, but CUDA might be another amazing performance in-road on systems which have NVIDIA cards. I've come to the point in benchmarks where I have to admit that CUDA might offer more speed for less complexity, when compatible, (even if I have a knee-jerk to grumble about wondering whether that could be reversed with reversed vendor prioritization of the respective standards, but I can admit that I like what I've seen of the CUDA standard).

QEngineOCL should be replaced with appropriate build options by QEngineCUDA. OCLEngine likely simply isn't necessary in that build. Watch out for ENABLE_OPENCL macro conditionals throughout the library, particularly in constructors and maybe in the PInvoke API. Please reuse our kernels, and please reach out to me if I can help, with more information or with implementation.

WrathfulSpatula commented 3 years ago

Now that half precision amplitudes have been added as an option, I want to note that we can accept a QEngineCUDA implementation without half support, yet. half is simply not a primitive, yet, but I could personally add it to a CUDA engine pull request we accepted. (See #664 for details about half implementation, including a list of PRs.)

WrathfulSpatula commented 1 year ago

Closed by #984!

unitaryfund / qrack

Feature: Optional CUDA Support #397