unitaryfund / pyqrack

Pure Python bindings for the pure C++11/OpenCL Qrack quantum computer simulator library
MIT License
16 stars 8 forks source link

OpenCL (multiple) platform interop instability #6

Closed WrathfulSpatula closed 2 years ago

WrathfulSpatula commented 2 years ago

By default, the underlying Qrack library tries to use every OpenCL device available on the system, as best it can, at once. We give the option to control the primary device, via environment variable, but I've run into a situation, just in the past couple of days, where my Intel HD doesn't interop correctly with my NVIDIA RTX, under significant load.

We should give users another environment variable to soft-ban OpenCL devices from automatic attempts at load balancing. Alternatively, we might rather state this option in terms of a list of devices to include exclusively in Qrack's automatic multiple-device operation. Either way, I'm currently using VirtualCL to hide OpenCL devices from the Qrack environment context, but this overhead shouldn't be necessary, when this would be trivial to implement in Qrack directly.

WrathfulSpatula commented 2 years ago

After 15 minutes of testing, this is temporarily on hold. Although banning my Intel HD at an OpenCL virtualization level restores the stability of my 2-device system, as does completely removing the Intel ICD, (sudo apt remove intel-opencl-icd), simply not using the device still results in instability. The NVIDIA and Intel ICDs just don't play nicely together on my system, right now. (I can't and wouldn't say whether this is either the Intel or NVIDIA ICD's "fault," either way.)

We've been discussing this on the Qrack Discord server/channel for about a week, now, and we're leaning toward behavior like this being due to bugs in the drivers that aren't within our control at Qrack level. Unfortunately, if you're having problems like mine, you need to use an OpenCL virtualization layer like VirtualCL to ban any combination of devices that are causing system instability, or you can completely uninstall certain ICDs, (like with sudo apt remove intel-opencl-icd, for the Intel NEO runtime on my Ubuntu machine that runs Intel alongside a NVIDIA ICD).

Qrack-level device selectivity simply doesn't fix the original issue, right now, so we'll experiment and plan alternatives. For now, OpenCL virtualization layers or package management are the best options we have, unfortunately.

WrathfulSpatula commented 2 years ago

By the way, of course, this could be a Qrack internal bug. We've been trying to diagnose for over a week, if this can be 100% fixed just in Qrack, and I'm still trying to find a way. However, when I get CL_INVALID_MEM_OBJECT when copying between buffers, even if the two buffers involved do reside on different OpenCL devices, my understanding of the OpenCL standard is that buffer migration is expressly supposed to happen automatically. Maybe that doesn't mean that our interop is perfect, though, like if kernel argument hooks still need a manual migration back to the correct context, but I don't think this should be necessary, strictly according to the OpenCL standard.

WrathfulSpatula commented 2 years ago

This is actually fixed by now, in effect, so I apologize for inattentiveness to my own issue report. By v0.11.1, cross-platform interoperability is more stable than ever, including QPager based NVIDIA/Intel HD interop up to at least 32qb on many consumer platforms. Also, isSchmidtDecomposeMulti doesn't suffer instability from NVIDIA/Intel HD interop, anymore.