Global thread pool (for async QInterface method dispatch)

unitaryfund / qrack

Comprehensive, GPU accelerated framework for developing universal virtual quantum processors

GNU Lesser General Public License v3.0

177 stars 38 forks source link

QUnit attempts to dispatch many small qubit subsystems as asynchronous calls, with std::future. Due to gradual and recent improvements in the CPU domain of our "CPU/GPU hybridization" techniques, QUnit and other simulator layers run many parallelizable asynchronous tasks, which is great, but we run too many for a <=16 hyperthread processor. If it helps, QUnit asynchronous parallelization can be disabled by building with -DENABLE_QUNIT_CPU_PARALLEL=OFF, (which actually works by turning off QEngineCPU asynchronous parallelism). However, the asynchronous gains are rather significant, if we could maximize utilization of CPU parallelism while avoiding the (dreaded) resource temporarily unavailable exception, from too many thread dispatches for the OS.

Native Windows, or other operating systems, might already mostly not suffer from this problem, as the POSIX threading model can differ. Running on Ubuntu, (Linux,) user code POSIX threads are probably limited to about or exactly the CPU hyperthread count.

An obvious solution is a centralized CPU thread dispatch, (like as could be a singleton DispatchQueue, which QEngineCPU already uses slightly differently). Thread availability contention, in the dispatch, vs. maximal "async" parallelization of QUnit subsystems, likely pays off very handsomely on net.

While this would systematically prevent resource temporarily unavailable from ever happening, potentially, it turns out it's not practically necessary, (and we also avoid a singleton pattern, therefore).

QEngineCPU previously dispatched an "async" task for all QUnit subsystem method calls below the threshold of efficient parallelism, down to 1 or 2 qubit subsystems. However, as fast as the typical consumer CPU single thread speed is, we can actually get slightly better performance by handling very small subsystem method calls on the "main" or "UI" thread, while we avoid many thread dispatches in the process. The "sweet spot" for switching between main thread and "async" dispatch, on my system, seems to be very roughly around 2 qubits below the PSTRIDEPOW parameter, which controls how many work items are dispatched to a single thread at a time, and that parameter can be tuned at build time and by environment variable, now correspondingly raising or lowering this "async" threshold as well. By default, this puts the minimum "async" subsystem size at about 12 qubits, so it would be exceedingly rare to ever dispatch even 3 asynchronous method calls at a time in a single QUnit, if we had a simulation of 36+ qubits in the first place.

At an extreme of hybrid simulation method user code ever dispatching about 2-4 threads at once for a simulator, this all simply becomes moot. For anything failing the resource requirements to support even this case, we have long had the -DENABLE_QUNIT_CPU_PARALLEL=OFF CMake build option, to completely disable this asynchronous behavior, and that would likely be preferable or required, for a processor that limited. Hence, we can table the central dispatch singleton, for now.

unitaryfund / qrack

Global thread pool (for async QInterface method dispatch) #926