unitaryfund / qrack

Comprehensive, GPU accelerated framework for developing universal virtual quantum processors
https://qrack.readthedocs.io/en/latest/
GNU Lesser General Public License v3.0
176 stars 38 forks source link

Mixed FP precision for QUnit subsystems #939

Open WrathfulSpatula opened 2 years ago

WrathfulSpatula commented 2 years ago

Basically, Qrack accepts as standard that fp16 is naturally accurate up to 16qb, fp32 is valid up to 32qb, fp64 is valid up to 64qb... etc..

If we allowed precision to be mixed in the same build, we could size floating point precision according to QUnit separable subsystem size. This would require up-casting and down-casting, but it might make sense at the fp16/fp32 boundary for (hybrid) CPU simulation.

WrathfulSpatula commented 2 years ago

Seeing as (according to the above) all we want this for is QEngineCPU, it could be furnished by a generically typed QEngineCPU, for state vector floating point precision. It would require cross-precision Compose()/Decompose()/Dispose() overloads, for QUnit.

WrathfulSpatula commented 2 years ago

Since Qrack contains a portable IEEE fp16 definition in an OSS header, we can safely assume that fp16 is also present. Then, no fp types higher than build -DFPPOW=[n] will be referenced in QEngineCPU or anywhere in the library. (Practically, there are some systems have float available, but not double.)

WrathfulSpatula commented 2 years ago

If Compose() defaults to this pointer precision in up-cast, we can toggle Compose() order and position between front and back to get the appropriate resultant precision. Decompose() doesn't need to down-cast, because, for QUnit precision hybridization purposes, we'll include on-demand up-cast/down-cast methods.

(EDIT: Actually, we'll just let QUnit precision hybridization generally use on-demand up/down-cast.)

WrathfulSpatula commented 2 years ago

Then again, fusing the operations of up-casting with Compose() might be critical for practical payoff. This is tricky at the boundary between CPU/GPU, but the general optimization would be practical if CPU/GPU hybridization was above 16qb. Given that consideration, which is not typically satisfied, this is momentarily on backlog.