Faster Circuit Primitives

coreyostrove commented 4 months ago

I have always been slightly perturbed by the amount of time that things such as COPA layout creation take when performing GST analysis. There are a number of reasons for this, but one family of reasons is that layout generation relies heavily on a number of Circuit primitives that are not as performant as they can be. This PR is a step toward rectifying this through re-implementation of a number of methods identified as bottlenecks through profiling in the context of actual end-to-end GST analysis.

In terms of the top-level numbers, for 1-qubit GST fitting using the 'full TP' parameterization with an experiment design containing 1120 circuits (max depth 128), the changes on this branch collectively reduce the time required for layout creation by ~50%, and reduce the overall runtime of the end-to-end protocol by ~15%. Your mileage will of course vary, however, and I would anticipate higher relative savings for larger/deeper experiment designs. I have not done profiling for the two-qubit case yet, so I won't posit any guesses on the impact there (I would guess it to be less relatively speaking, but can't easily ballpark this without actually testing).

Summary of Changes:

Here is a high-level summary of the changes. I will also be leaving some comments on particular bits of code to explain any more detailed reasoning for some changes (when in doubt the answer is likely that I tried a few different things and picked the fastest way to do something).

Note on synthetic benchmarks: For the quoted figures below these were mostly done using a randomly generated set of test circuits generated according to the following code snippet:

import pygsti
from pygsti.circuits.circuitconstruction import list_random_circuits_onelen
from pygsti.modelpacks import smq1Q_XYI
import numpy as np
from numpy.random import default_rng
from itertools import product
lens = [1,2,4,8,16,32,64]
op_labels = tuple(smq1Q_XYI.target_model().operations.keys())
random_circuits = []
for length in lens:
    random_circuits.extend(list_random_circuits_onelen(op_labels, length, 100, seed = 1234))
random_circuits_2 = []
for length in lens:
    random_circuits_2.extend(list_random_circuits_onelen(op_labels, length, 100, seed = 12345))
rng = default_rng(1234)
rng.shuffle(random_circuits)
rng = default_rng(12345)
rng.shuffle(random_circuits_2)
random_pairs = list(product(random_circuits,repeat=2))

For equality testing this was run against random_pairs, for inclusion testing against random_circuits and random_circuits_2, and for other tests against random_circuits unless stated otherwise.

Circuit equality testing: The primary change was adding a length check to allow quicker short circuiting. A somewhat faster codepath was also added for comparisons between static circuits, avoiding recomputation of a circuit's tuple representation. In synthetic testing this gave a ~3X speed up.
Inclusion testing: Nothing specific was added here, but the change to circuit equality testing also sped this up by a slightly larger factor of ~3.4.
Hashing: Hashes are now precomputed and cached at the point at which a circuit is made static (this is an accepted practice in python for immutable objects, e.g. strings cache their hashes). Related to this change, at staticization time we precompute the tuple representation of the circuit and store this as well. This is the thing which actually gets hashed. Previously the tup property was called whenever this was needed and this property call regenenerated the tuple representation from scratch each time, which is expensive. In synthetic tests this resulted in a ~4.5X speedup, but in the context of the GST profiling described above this was closer to 10X (the computational savings should scale with circuit depth, so this isn't too shocking of a difference). As part of this change I have added static assertions to a few methods previously missing them for safety (we don't want to inadvertently invalidate the cached hash).
Copying: Previously copying a circuit went through the standard __init__ pathway to create a copied circuit instance. With the new implementation we utilize a much faster specialized codepath inspired by the previously existing _fastinit one. But with a few more microptimizations to take advantage of some of the additional promises we can rely on in the context of copying (which we can't as easily in _fastinit). Speedups here range from 3-15X depending on the input and output circuit types. With the low end corresponding to switching static to editable (or vice versa), and the high end keeping the input type unchanged.

This is the end of the changes where I profiled these using synthetic tests. For the rest I did this in the context of GST optimization using cProfile/snakeviz, but didn't carefully record before/after relative changes.

Microoptimization galore: There are a bunch of microoptimizations of the sort that usually don't matter except for when code is particularly hot, and the circuit code is. I won't give an exhaustive list, but this includes things like:
- Attribute access is faster than method/property access/resolution (by 2X or more in some cases). So many trivial (and some non-trivial) property calls have had the corresponding attributes inlined.
- Inlining functions to reduce function call overhead.
- List comprehensions are faster than generator expressions. So, e.g. tuple([i for i in iterable]) is measurably faster than tuple(i for i in iterable) in practice.
- List comprehensions are faster than the 'build via appends in for loop' approach. Often by a notable amount.
- Using falsy/truthiness directly is faster than the if len(obj)>0 idiom if all you're doing is checking if an iterable is empty.
Rework SeparatePOVMCircuit internals: The full_effect_labels property was getting called a lot, and each time it was it was regenerating the result from scratch (with some non-trivial preprocessing). I've updated the internals to cache this, and also deal with updating the cached value when necessary. This gave a measurable speedup.
New sandwich method for circuits. This takes as input two tuples of Labels which are prepended and appended to a circuit. This is around twice as fast as doing this with a pair of calls to __add__ (since we only need 1 call to _fastinit). This is not an uncommon task (I added as part of some changes to the circuit completion code in OpModel), so if you spot any places where this is being done it'd be worth switching (maybe @rileyjmurray can write a fancy regex to find these instances).

Additional changes (either outside of the Circuit class, or otherwise not directly performance focused):

Bulk versions of complete_circuit and split_circuit: There are now versions of the complete_circuit and split_circuit methods for the OpModel class, which respectively add prep and povm labels to a circuit or split them off, which operate on a complete list of circuits all at once. This allows for avoiding some model specific lookups (for values which don't change circuit-to-circuit) many times and is faster overall.
expand_subcircuits_inplace: The implementation of expand_subcircuits_inplace has been reworked so that the behavior is now to recursively expand CircuitLabel objects. Previously the expansion routine only took a single pass through the circuit, meaning that if the circuit contained any nested CircuitLabels (which are perfectly happy with being nested, an easy way to get something like this is to call to_label on a circuit containing CircuitLabels) we would only end up expanding out the top level. I noticed there were a few places in the layout creation/forward simulation code that implicitly assumed only a single-level of CircuitLabels would be present, which would likely have eventually led to some really annoying bugs to track down. This change covers that edge case, and should only be slightly more expensive than the previous routine in the typical case where there is only a single-level of CircuitLabels.
I've added a deprecation warning to the simulate method. If there is strong opposition to removing this functionality we can discuss this, of course. My reasoning here is that given the rest of pygsti's structure this functionality really isn't something we should be asking a Circuit to do, as the preferred pattern nowadays is delegate this job to the Model/ForwardSimulator, rather than asking the circuit itself. I don't know how widespread the use of this is (my guess is that this is largely vestigial), but I'd recommend slating this for removal in 0.9.14. Related note, I don't know if we have a running list of things we've deprecated and their expiry date, but if not we should assemble one.

Finally, there are a number of miscellaneous documentation fixes, bugfixes, and updates to unit tests.

P.S. There are some more layout creation performance enhancements inbound on a separate branch, but I've opted to keep these changes together in a separate PR to keep things thematic and make it easier to review without context switching.

coreyostrove commented 4 months ago

Thanks for the feedback, @sserita! I have responded to a number of your questions/comments above. I've also just pushed a commit incorporating some of your suggested changes.

rileyjmurray commented 4 months ago

@coreyostrove I've gone through all my PR comments and (tried to) close the ones that were just about generators vs list comprehensions. There are still some meaningful comments left, so here's direct links to them:

https://github.com/sandialabs/pyGSTi/pull/445#discussion_r1627643676 (there are many copies of this one)
https://github.com/sandialabs/pyGSTi/pull/445#discussion_r1627673976
https://github.com/sandialabs/pyGSTi/pull/445#discussion_r1627678229

coreyostrove commented 4 months ago

Thanks for the feedback, @sserita and @rileyjmurray. I have addressed a number of your comments, so I have gone through and marked those as such to make it easier to see which conversation threads still need following up on.

Right now it looks like the main open discussions remaining are:

Code reuse options for the new _copy_init codepath. (Incidentally, I've actually wound up using _copy_init for something other than direct copying on the 'feature-faster-layout-creation' branch, so it is proving useful for accelerating some other circuit methods besides copying).
The right way to refactor the is_simple attribute/method for Label classes.
Whether or not we want to add a static flag/static checks to the SeparatePOVMCircuit class.

Let me know if there is anything I have missed, or if there is something I have marked as resolved which you think still needs working on.

sserita commented 4 months ago

@coreyostrove My current stance on your short list:

This would be something that would be great to sort out, but if you have already tried and it was unclear the best path for reuse, then your current solution of comments pointing people to keep the functions in sync is a reasonable one. It does make some sense that the restricted version of _copy_init has optimizations that makes it hard to use within _bare_init... It was just not obvious to me at first glance that those were truly limiting.
I think this is definitely something we should fix, and I left a suggestion on the relevant thread.
When I made my comment about this, I didn't realize we were in SeparatePOVMCircuit. I'm OK with shelving this for now - I just wanted to make sure that Circuit had consistent static logic, not necessarily force other classes to adopt the static mode.

coreyostrove commented 3 months ago

Awesome, I just pushed the changes addressing the Label attribute conventions, so with that I think we're in good shape. Thanks @sserita and @rileyjmurray for your careful reviews and valuable feedback!

enielse commented 3 months ago

These updates all look great to me.

sserita commented 2 months ago

Beta tests clear, merging!

sandialabs / pyGSTi

Faster Circuit Primitives #445

Summary of Changes: