unitaryfund / qrack

Comprehensive, GPU accelerated framework for developing universal virtual quantum processors
https://qrack.readthedocs.io/en/latest/
GNU Lesser General Public License v3.0
177 stars 38 forks source link

OpenCL cross-context interop (NVIDIA/Intel HD) #936

Closed WrathfulSpatula closed 2 years ago

WrathfulSpatula commented 2 years ago

I'm realizing that it should be theoretically possible to attain an additional cross-device global qubit on my systems with a NVIDIA card and a larger Intel HD max allocation segment and total global RAM. However, the most common blocker to achieving this seems to be -38, CL_INVALID_MEM_OBJECT error code upon setting cross-context kernel arguments. Buffer migration can happen implicitly or explicitly, according to the OpenCL standard, but it seems like we might need an explicit buffer migration in this case.

I'm working on this, today. I hope that systems with both an NVIDIA GPU and an Intel HD, for example, should be able to achieve an additional "ket" method qubit, as a result.

WrathfulSpatula commented 2 years ago

So, the problem might not be cross-context, but specifically cross-platform buffer migration interoperability--and VirtualCL might already handle this for us! I'm experimenting with VCL for this purpose, now.

WrathfulSpatula commented 2 years ago

VirtualCL hangs quite quickly, but now I'm experimenting with the --use-host-dma option, which allocates OpenCL state vector buffers on host RAM. (This can be controlled by simulation factory/constructor argument, as the benchmark suite implements this option.) If this works, it's comparable to how cross-platform buffer migration would happen without the option. (It's also not drastically more execution time then pure VRAM simulation, so far.)

WrathfulSpatula commented 2 years ago

--use-host-dma works well, but not perfectly. On the first run, I have a 58% success rate on 31 qubits for test_qft_cosmology, which almost always requires 31 qubits of "ket" simulation by the end of any sample. For now, this is the suggested method for cross-platform interoperability, like in the case of my NVIDIA/Intel HD hybrid system!

WrathfulSpatula commented 2 years ago

Note, using the environment variable as export QRACK_SEGMENT_GLOBAL_QB=1 seems to help in my case, so long as I remember that my NVIDIA will segment into 8 pages, as a result. The right value of this variable depends upon heterogeneous max allocation segment size differences, and might even be 0. Success rate might be 100%, now.

WrathfulSpatula commented 2 years ago

This seems to have stopped working, weirdly. I don't have a satisfactory explanation for why. Hypothetically, an additional qubit might have been attainable due to me having the build in FP16 mode, but that doesn't make much sense, given that I don't trust FP16 above about 16 qubits, and I knew I was attempting 31 qubits, so I rarely build for FP16 at all. The execution time numbers also didn't seem consistent with FP16, whether with the --use-dma option, after the fact.

Occasionally, we get a performance "hiccup." The last qubit at 31 width for the QFT also seemed to take less than twice the time for the 30 qubit QFT, which might have been too-good-to-be-true. Sorry for the false alarm, but I'll keep hacking at it.