Closed WrathfulSpatula closed 2 years ago
So, the problem might not be cross-context, but specifically cross-platform buffer migration interoperability--and VirtualCL might already handle this for us! I'm experimenting with VCL for this purpose, now.
VirtualCL hangs quite quickly, but now I'm experimenting with the --use-host-dma
option, which allocates OpenCL state vector buffers on host RAM. (This can be controlled by simulation factory/constructor argument, as the benchmark suite implements this option.) If this works, it's comparable to how cross-platform buffer migration would happen without the option. (It's also not drastically more execution time then pure VRAM simulation, so far.)
--use-host-dma
works well, but not perfectly. On the first run, I have a 58% success rate on 31 qubits for test_qft_cosmology
, which almost always requires 31 qubits of "ket" simulation by the end of any sample. For now, this is the suggested method for cross-platform interoperability, like in the case of my NVIDIA/Intel HD hybrid system!
Note, using the environment variable as export QRACK_SEGMENT_GLOBAL_QB=1
seems to help in my case, so long as I remember that my NVIDIA will segment into 8 pages, as a result. The right value of this variable depends upon heterogeneous max allocation segment size differences, and might even be 0
. Success rate might be 100%, now.
This seems to have stopped working, weirdly. I don't have a satisfactory explanation for why. Hypothetically, an additional qubit might have been attainable due to me having the build in FP16 mode, but that doesn't make much sense, given that I don't trust FP16 above about 16 qubits, and I knew I was attempting 31 qubits, so I rarely build for FP16 at all. The execution time numbers also didn't seem consistent with FP16, whether with the --use-dma
option, after the fact.
Occasionally, we get a performance "hiccup." The last qubit at 31 width for the QFT also seemed to take less than twice the time for the 30 qubit QFT, which might have been too-good-to-be-true. Sorry for the false alarm, but I'll keep hacking at it.
I'm realizing that it should be theoretically possible to attain an additional cross-device global qubit on my systems with a NVIDIA card and a larger Intel HD max allocation segment and total global RAM. However, the most common blocker to achieving this seems to be
-38
,CL_INVALID_MEM_OBJECT
error code upon setting cross-context kernel arguments. Buffer migration can happen implicitly or explicitly, according to the OpenCL standard, but it seems like we might need an explicit buffer migration in this case.I'm working on this, today. I hope that systems with both an NVIDIA GPU and an Intel HD, for example, should be able to achieve an additional "ket" method qubit, as a result.