Closed coreyostrove closed 2 months ago
Thanks for the feedback, @sserita! I have responded to a number of your questions/comments above. I've also just pushed a commit incorporating some of your suggested changes.
@coreyostrove I've gone through all my PR comments and (tried to) close the ones that were just about generators vs list comprehensions. There are still some meaningful comments left, so here's direct links to them:
Thanks for the feedback, @sserita and @rileyjmurray. I have addressed a number of your comments, so I have gone through and marked those as such to make it easier to see which conversation threads still need following up on.
Right now it looks like the main open discussions remaining are:
_copy_init
codepath. (Incidentally, I've actually wound up using _copy_init
for something other than direct copying on the 'feature-faster-layout-creation' branch, so it is proving useful for accelerating some other circuit methods besides copying).is_simple
attribute/method for Label
classes.SeparatePOVMCircuit
class.Let me know if there is anything I have missed, or if there is something I have marked as resolved which you think still needs working on.
@coreyostrove My current stance on your short list:
Awesome, I just pushed the changes addressing the Label
attribute conventions, so with that I think we're in good shape. Thanks @sserita and @rileyjmurray for your careful reviews and valuable feedback!
These updates all look great to me.
Beta tests clear, merging!
I have always been slightly perturbed by the amount of time that things such as COPA layout creation take when performing GST analysis. There are a number of reasons for this, but one family of reasons is that layout generation relies heavily on a number of Circuit primitives that are not as performant as they can be. This PR is a step toward rectifying this through re-implementation of a number of methods identified as bottlenecks through profiling in the context of actual end-to-end GST analysis.
In terms of the top-level numbers, for 1-qubit GST fitting using the 'full TP' parameterization with an experiment design containing 1120 circuits (max depth 128), the changes on this branch collectively reduce the time required for layout creation by ~50%, and reduce the overall runtime of the end-to-end protocol by ~15%. Your mileage will of course vary, however, and I would anticipate higher relative savings for larger/deeper experiment designs. I have not done profiling for the two-qubit case yet, so I won't posit any guesses on the impact there (I would guess it to be less relatively speaking, but can't easily ballpark this without actually testing).
Summary of Changes:
Here is a high-level summary of the changes. I will also be leaving some comments on particular bits of code to explain any more detailed reasoning for some changes (when in doubt the answer is likely that I tried a few different things and picked the fastest way to do something).
Note on synthetic benchmarks: For the quoted figures below these were mostly done using a randomly generated set of test circuits generated according to the following code snippet:
For equality testing this was run against
random_pairs
, for inclusion testing againstrandom_circuits
andrandom_circuits_2
, and for other tests againstrandom_circuits
unless stated otherwise.tup
property was called whenever this was needed and this property call regenenerated the tuple representation from scratch each time, which is expensive. In synthetic tests this resulted in a ~4.5X speedup, but in the context of the GST profiling described above this was closer to 10X (the computational savings should scale with circuit depth, so this isn't too shocking of a difference). As part of this change I have added static assertions to a few methods previously missing them for safety (we don't want to inadvertently invalidate the cached hash).__init__
pathway to create a copied circuit instance. With the new implementation we utilize a much faster specialized codepath inspired by the previously existing_fastinit
one. But with a few more microptimizations to take advantage of some of the additional promises we can rely on in the context of copying (which we can't as easily in_fastinit
). Speedups here range from 3-15X depending on the input and output circuit types. With the low end corresponding to switching static to editable (or vice versa), and the high end keeping the input type unchanged.This is the end of the changes where I profiled these using synthetic tests. For the rest I did this in the context of GST optimization using cProfile/snakeviz, but didn't carefully record before/after relative changes.
Microoptimization galore: There are a bunch of microoptimizations of the sort that usually don't matter except for when code is particularly hot, and the circuit code is. I won't give an exhaustive list, but this includes things like:
tuple([i for i in iterable])
is measurably faster thantuple(i for i in iterable)
in practice.if len(obj)>0
idiom if all you're doing is checking if an iterable is empty.Rework SeparatePOVMCircuit internals: The
full_effect_labels
property was getting called a lot, and each time it was it was regenerating the result from scratch (with some non-trivial preprocessing). I've updated the internals to cache this, and also deal with updating the cached value when necessary. This gave a measurable speedup.New
sandwich
method for circuits. This takes as input two tuples of Labels which are prepended and appended to a circuit. This is around twice as fast as doing this with a pair of calls to__add__
(since we only need 1 call to_fastinit
). This is not an uncommon task (I added as part of some changes to the circuit completion code in OpModel), so if you spot any places where this is being done it'd be worth switching (maybe @rileyjmurray can write a fancy regex to find these instances).Additional changes (either outside of the Circuit class, or otherwise not directly performance focused):
complete_circuit
andsplit_circuit
: There are now versions of thecomplete_circuit
andsplit_circuit
methods for the OpModel class, which respectively add prep and povm labels to a circuit or split them off, which operate on a complete list of circuits all at once. This allows for avoiding some model specific lookups (for values which don't change circuit-to-circuit) many times and is faster overall.expand_subcircuits_inplace
: The implementation ofexpand_subcircuits_inplace
has been reworked so that the behavior is now to recursively expand CircuitLabel objects. Previously the expansion routine only took a single pass through the circuit, meaning that if the circuit contained any nested CircuitLabels (which are perfectly happy with being nested, an easy way to get something like this is to callto_label
on a circuit containing CircuitLabels) we would only end up expanding out the top level. I noticed there were a few places in the layout creation/forward simulation code that implicitly assumed only a single-level of CircuitLabels would be present, which would likely have eventually led to some really annoying bugs to track down. This change covers that edge case, and should only be slightly more expensive than the previous routine in the typical case where there is only a single-level of CircuitLabels.simulate
method. If there is strong opposition to removing this functionality we can discuss this, of course. My reasoning here is that given the rest of pygsti's structure this functionality really isn't something we should be asking a Circuit to do, as the preferred pattern nowadays is delegate this job to the Model/ForwardSimulator, rather than asking the circuit itself. I don't know how widespread the use of this is (my guess is that this is largely vestigial), but I'd recommend slating this for removal in 0.9.14. Related note, I don't know if we have a running list of things we've deprecated and their expiry date, but if not we should assemble one.Finally, there are a number of miscellaneous documentation fixes, bugfixes, and updates to unit tests.
P.S. There are some more layout creation performance enhancements inbound on a separate branch, but I've opted to keep these changes together in a separate PR to keep things thematic and make it easier to review without context switching.