Add support for "flat_sample": sampling without keys for performance

Strilanc commented 3 years ago

Currently Cirq has a bit of a performance problem. One of the contributing factors to this is that there are very few "escape hatches" that you can use to opt out of features that slow things down. This suggestion is to add one such escape hatch.

A "flat sample" is a 2d numpy uint8 array where the major index is "repetition index" and the minor axis lists off the output bits ordered by moment, then by within-moment order, then by qubit order within each measurement operation. If we want to go even more extreme, the array could be bit packed to reduce space usage by 8x although then it's much harder for people to use.

To be explicit:

k = 0
for t, moment in enumerate(circuit):
    for op in moment:
        if isinstance(op.gate, cirq.MeasurementGate):
            for q in op.qubits:
                print(f"flat index {k} refers to the measurement of qubit {q} in moment {t}")
                k += 1

Flat measurements can be more performant than non-flat measurements because they don't require scattering values across memory as they are collected. They also implicitly avoid key collision issues to do with measurements being repeated in a loop. This makes them much more amenable to e.g. error correction circuits.

I propose that we modify Sampler to have a flat_sample(circuit) method that by default delegates to Sampler.run, but which we then write more efficient implementations for in cirq.Simulator.

daxfohl commented 3 years ago

Do you have an example of when the cost of key lookup would be significant compared to the cost of doing the simulation itself? It would be valuable to be able to provide before/after benchmarks vs the cost of maintaining the feature.

daxfohl commented 3 years ago

Also I wonder if this would be better done using a strategy pattern instead of inheritance. So, you set (or preferably inject) repetition_sampling_strategy instead of requiring a method override. That would allow you to pick and choose which strategy you want rather than having it bound to the simulator class and dealing with inheritance. (There are a number of other things in simulators that I feel are orthogonal concerns and may be better to extract into strategy patterns as well, but that's a different topic).

(Also, granted I'm still new to Python so if there's a more Pythonic way of doing this then that's fine too.)

maffoo commented 3 years ago

I like the idea of a packed representation because it's very natural to get from hardware. A different way to support something like this would be to make cirq.Result an abstract type instead of a concrete type. Then samplers would be free to make their own implementations using whatever packed data representation they want, as long as they support "unpacking" to access data for particular measurements by key, per the Result API.

Note, the v1 engine API defined exactly this sort of packed result format: https://github.com/quantumlib/Cirq/blob/master/cirq/google/api/v1/program.proto#L35 (there, the results are actually bit packed, giving the ~8x size reduction mentioned by @Strilanc). We ended up changing this in the v2 api to use a different format with results indexed by key, but TBH I think it'd be worth revisiting this decision for performance reasons, especially if the cirq.Result type itself becomes abstract and we have implementations for packed data.

maffoo commented 3 years ago

A somewhat related issue is #3233 which wants to generalize the Result type along other axes such as allowing ints (for qudit measurements) or IQ points (for lower-level hardware readout info) instead of bits. In those cases, as here, I think decoupling the cirq.Result interface from particular concrete implementations would be very helpful.

daxfohl commented 3 years ago

Also is this something that would have to be done in the simulator, or is it something that could be done in post processing? If the latter, we can make a function that takes in a measurement map and a circuit, and output a json-like object that has all the nested subcircuit and repetition measurements organized as nested objects and arrays. Then that should be easy analyze efficiently.

Basically, does the perf bottleneck primarily occur during the simulation itself, or in the post analysis? If the latter, then maybe the above could help resolve the issue.

balopat commented 3 years ago

+1 to dax's comment - I would like to understand a bit more the use case(s) where the performance issue shows itself, please comment on that @Strilanc - alternatively we can discuss on Cirq Cynque (added the label).

Regarding the implementation: I'm open towards this - I like @maffoo's abstract Result direction + revisiting the program proto + tying it in with the IQ points solution potentially.

daxfohl commented 3 years ago

Maybe the simplest solution is to change the classes where you measure and store results from Dict to OrderedDict. Then you can reduce that to raw ordered list of measurements and pack it as much as you want after the fact. This way you don't have to inject functionality into all the simulators or add extra layers of abstraction.

Alternatively, a function that takes a circuit and gives you the ordered list of measurements it produces. You can then create a packed ordered measurement array from that and the result dictionary.

Gut feeling is we probably want to avoid abstracting Result here. I don't see a case for these two representations to be used interchangeably. Trying to force them into an inheritance hierarchy may just cause confusion in the end. Better to keep them as two separate things for now.

daxfohl commented 3 years ago

I just saw there's a measurement_keys protocol, but it returns a set instead of a list. If we change that to list, then it looks like it returns everything in order (it worked on all the test cases I tried anyway: subcircuits, repetition ids, key mappings). From which you can do the second option I mentioned above. Would that solve the problem?

Strilanc commented 3 years ago

Do you have an example of when the cost of key lookup would be significant compared to the cost of doing the simulation itself?

It's not so much that it's a significant runtime cost as that it is one of the cuts in a death by a thousand cuts.

...Using a bit packed format instead of a bool-per-byte format often gives noticeable speed benefits just due to touching less memory, but I wouldn't expect most data sources and consumers to want to deal with the hassle of packed bits.

I think it's notable that the flat format, and the bit packed format, are much closer to what is actually produced by the hardware. This suggests there may be implementation convenience benefits, and removal of conversion costs.

daxfohl commented 3 years ago

If we want this optimization during simulation, we could invert the above option. Have the record_measurement_result function record it in a packed array (which will be only a single code change this once #3841 is merged, which creates an ActOnArgs base class that contains this implementation), and then use the new measurement_keys_in_order protocol mentioned above to unpack the array into a dictionary before returning the run result. And of course provide a run_without_unpacking that returns the raw array.

This approach would give the functionality to all the _act_on_ simulators (sparse, DM, clifford, maybe mps soon). @Strilanc, which simulators are most important here? Are you primarily concerned about these, or qsim, etc?

balopat commented 3 years ago

Cirq Cynque:

unanimous agreement around getting this done: abstract Result class + having the ability for different runtime to use whatever native result format and then provide converter methods to it
needs detailed design (a little RFC would be nice!)

quantumlib / Cirq

Add support for "flat_sample": sampling without keys for performance #3808