Open Strilanc opened 3 years ago
Do you have an example of when the cost of key lookup would be significant compared to the cost of doing the simulation itself? It would be valuable to be able to provide before/after benchmarks vs the cost of maintaining the feature.
Also I wonder if this would be better done using a strategy pattern instead of inheritance. So, you set (or preferably inject) repetition_sampling_strategy
instead of requiring a method override. That would allow you to pick and choose which strategy you want rather than having it bound to the simulator class and dealing with inheritance. (There are a number of other things in simulators that I feel are orthogonal concerns and may be better to extract into strategy patterns as well, but that's a different topic).
(Also, granted I'm still new to Python so if there's a more Pythonic way of doing this then that's fine too.)
I like the idea of a packed representation because it's very natural to get from hardware. A different way to support something like this would be to make cirq.Result
an abstract type instead of a concrete type. Then samplers would be free to make their own implementations using whatever packed data representation they want, as long as they support "unpacking" to access data for particular measurements by key, per the Result
API.
Note, the v1 engine API defined exactly this sort of packed result format: https://github.com/quantumlib/Cirq/blob/master/cirq/google/api/v1/program.proto#L35 (there, the results are actually bit packed, giving the ~8x size reduction mentioned by @Strilanc). We ended up changing this in the v2 api to use a different format with results indexed by key, but TBH I think it'd be worth revisiting this decision for performance reasons, especially if the cirq.Result
type itself becomes abstract and we have implementations for packed data.
A somewhat related issue is #3233 which wants to generalize the Result
type along other axes such as allowing ints (for qudit measurements) or IQ points (for lower-level hardware readout info) instead of bits. In those cases, as here, I think decoupling the cirq.Result
interface from particular concrete implementations would be very helpful.
Also is this something that would have to be done in the simulator, or is it something that could be done in post processing? If the latter, we can make a function that takes in a measurement map and a circuit, and output a json-like object that has all the nested subcircuit and repetition measurements organized as nested objects and arrays. Then that should be easy analyze efficiently.
Basically, does the perf bottleneck primarily occur during the simulation itself, or in the post analysis? If the latter, then maybe the above could help resolve the issue.
+1 to dax's comment - I would like to understand a bit more the use case(s) where the performance issue shows itself, please comment on that @Strilanc - alternatively we can discuss on Cirq Cynque (added the label).
Regarding the implementation: I'm open towards this - I like @maffoo's abstract Result direction + revisiting the program proto + tying it in with the IQ points solution potentially.
Maybe the simplest solution is to change the classes where you measure and store results from Dict
to OrderedDict
. Then you can reduce that to raw ordered list of measurements and pack it as much as you want after the fact. This way you don't have to inject functionality into all the simulators or add extra layers of abstraction.
Alternatively, a function that takes a circuit and gives you the ordered list of measurements it produces. You can then create a packed ordered measurement array from that and the result dictionary.
Gut feeling is we probably want to avoid abstracting Result here. I don't see a case for these two representations to be used interchangeably. Trying to force them into an inheritance hierarchy may just cause confusion in the end. Better to keep them as two separate things for now.
I just saw there's a measurement_keys
protocol, but it returns a set instead of a list. If we change that to list, then it looks like it returns everything in order (it worked on all the test cases I tried anyway: subcircuits, repetition ids, key mappings). From which you can do the second option I mentioned above. Would that solve the problem?
Do you have an example of when the cost of key lookup would be significant compared to the cost of doing the simulation itself?
It's not so much that it's a significant runtime cost as that it is one of the cuts in a death by a thousand cuts.
...Using a bit packed format instead of a bool-per-byte format often gives noticeable speed benefits just due to touching less memory, but I wouldn't expect most data sources and consumers to want to deal with the hassle of packed bits.
I think it's notable that the flat format, and the bit packed format, are much closer to what is actually produced by the hardware. This suggests there may be implementation convenience benefits, and removal of conversion costs.
If we want this optimization during simulation, we could invert the above option. Have the record_measurement_result
function record it in a packed array (which will be only a single code change this once #3841 is merged, which creates an ActOnArgs
base class that contains this implementation), and then use the new measurement_keys_in_order
protocol mentioned above to unpack the array into a dictionary before returning the run
result. And of course provide a run_without_unpacking
that returns the raw array.
This approach would give the functionality to all the _act_on_
simulators (sparse, DM, clifford, maybe mps soon). @Strilanc, which simulators are most important here? Are you primarily concerned about these, or qsim, etc?
Cirq Cynque:
Currently Cirq has a bit of a performance problem. One of the contributing factors to this is that there are very few "escape hatches" that you can use to opt out of features that slow things down. This suggestion is to add one such escape hatch.
A "flat sample" is a 2d numpy uint8 array where the major index is "repetition index" and the minor axis lists off the output bits ordered by moment, then by within-moment order, then by qubit order within each measurement operation. If we want to go even more extreme, the array could be bit packed to reduce space usage by 8x although then it's much harder for people to use.
To be explicit:
Flat measurements can be more performant than non-flat measurements because they don't require scattering values across memory as they are collected. They also implicitly avoid key collision issues to do with measurements being repeated in a loop. This makes them much more amenable to e.g. error correction circuits.
I propose that we modify
Sampler
to have aflat_sample(circuit)
method that by default delegates toSampler.run
, but which we then write more efficient implementations for in cirq.Simulator.