Closed eendebakpt closed 1 month ago
Hi, @eendebakpt! Thanks for doing this digging and for your suggestion. We're swamped right now and expect to be swamped until March Meeting is over, so we can't promise immediate progress on this. That said, we discussed this issue in today's pyGSTi developer meeting and we came to the following conclusions.
We're generally supportive of simple changes that buy us 10% -- 15% savings in expensive routines. In this context we'd want to check a few things before implementing this fix:
Of course, the last point is vague and tricky. It really just means "some subset of {Stefan, Corey, Erik} want to look at code profiling results to see if anything else jumps out to them."
With that in mind, can you send us the profiling data you generated? Whatever mechanism works for you is fine with us.
Apologies for how long it has taken to look into your suggestion, @eendebakpt. I've finally done so, but I have unfortunately concluded that simple caching won't work in this instance. At least not in the context of GST optimization.
The method in question here is MatrixForwardSimulator._compute_product_cache
and it is a short enough function that is probably worth stepping through it to explain more, but it is also tricky enough that the fact that caching won't immediately help is non-obvious and worth documenting.
def _compute_product_cache(self, layout_atom_tree, resource_alloc):
"""
#Computes an array of operation sequence products (process matrices).
#Note: will *not* parallelize computation: parallelization should be
#done at a higher level.
"""
dim = self.model.evotype.minimal_dim(self.model.state_space)
#Note: resource_alloc gives procs that could work together to perform
# computation, e.g. paralllel dot products but NOT to just partition
# futher (e.g. among the wrt_slices) as this is done in the layout.
# This function doesn't make use of resource_alloc - all procs compute the same thing.
eval_tree = layout_atom_tree
cacheSize = len(eval_tree)
prodCache = _np.zeros((cacheSize, dim, dim), 'd')
scaleCache = _np.zeros(cacheSize, 'd')
The purpose of the product cache calculation is to, given a precomputed tree structure that identifies intermediate calculations (i.e. process matrices for partial circuit evaluations) to cache for efficient circuit probability calculations across some large set of circuits, construct all of the requisite intermediate results. This tree structure is what is renamed to eval_tree here, and the remaining arrays are for storing the calculated intermediate results.
for iDest, iRight, iLeft in eval_tree:
#Special case of an "initial operation" that can be filled directly
if iRight is None: # then iLeft gives operation:
opLabel = iLeft
if opLabel is None:
prodCache[iDest] = _np.identity(dim)
# Note: scaleCache[i] = 0.0 from initialization
else:
gate = self.model.circuit_layer_operator(opLabel, 'op').to_dense(on_space='minimal')
#gate = initial_op_cache[opLabel]
nG = max(_nla.norm(gate), 1.0)
prodCache[iDest] = gate / nG
scaleCache[iDest] = _np.log(nG)
continue
# combine iLeft + iRight => iDest
# LEXICOGRAPHICAL VS MATRIX ORDER Note: we reverse iLeft <=> iRight from eval_tree because
# (iRight,iLeft,iFinal) = tup implies circuit[i] = circuit[iLeft] + circuit[iRight], but we want:
# since then matrixOf(circuit[i]) = matrixOf(circuit[iLeft]) * matrixOf(circuit[iRight])
L, R = prodCache[iLeft], prodCache[iRight]
prodCache[iDest] = _np.dot(L, R)
scaleCache[iDest] = scaleCache[iLeft] + scaleCache[iRight]
if prodCache[iDest].max() < _PSMALL and prodCache[iDest].min() > -_PSMALL:
nL, nR = max(_nla.norm(L), _np.exp(-scaleCache[iLeft]),
1e-300), max(_nla.norm(R), _np.exp(-scaleCache[iRight]), 1e-300)
sL, sR = L / nL, R / nR
prodCache[iDest] = _np.dot(sL, sR); scaleCache[iDest] += _np.log(nL) + _np.log(nR)
nanOrInfCacheIndices = (~_np.isfinite(prodCache)).nonzero()[0] # may be duplicates (a list, not a set)
# since all scaled gates start with norm <= 1, products should all have norm <= 1
assert(len(nanOrInfCacheIndices) == 0)
return prodCache, scaleCache
This is the main loop for evaluating the prescription in eval_tree
. To understand what this is doing, it is helpful to have an actual evaluation tree to point at, so here is an example of one of these for 1-qubit GST with an XYI gate set (for a very small GST experiment with maximum depth of 1).
eval_tree=[(0, None, None), (1, None, Label(('Gxpi2', 0))), (2, None, Label(('Gypi2', 0))), (56, None, Label(())), (3, 1, 1), (6, 1, 2), (9, 2, 1), (10, 2, 2), (57, 56, 1), (58, 56, 2), (62, 1, 56), (68, 2, 56), (4, 3, 1), (5, 10, 2), (11, 9, 1), (14, 3, 2), (24, 9, 2), (33, 6, 1), (34, 6, 2), (38, 10, 1), (59, 57, 1), (63, 62, 1), (64, 62, 2), (69, 68, 1), (70, 68, 2), (74, 3, 56), (7, 4, 1), (8, 34, 2), (12, 11, 1), (13, 5, 2), (17, 4, 2), (20, 5, 1), (35, 33, 1), (39, 38, 1), (42, 14, 1), (43, 14, 2), (60, 59, 1), (61, 58, 10), (65, 63, 1), (71, 69, 1), (75, 74, 1), (76, 74, 2), (80, 4, 56), (86, 5, 56), (15, 7, 1), (16, 43, 2), (21, 20, 1), (25, 12, 1), (26, 24, 10), (27, 7, 2), (30, 20, 2), (36, 35, 1), (37, 8, 2), (40, 39, 1), (41, 13, 2), (44, 42, 1), (47, 17, 1), (48, 17, 2), (52, 13, 1), (66, 65, 1), (67, 64, 10), (72, 71, 1), (73, 70, 10), (77, 75, 1), (81, 80, 1), (82, 80, 2), (87, 86, 1), (88, 86, 2), (18, 15, 1), (19, 48, 2), (22, 21, 1), (23, 41, 2), (45, 44, 1), (46, 16, 2), (49, 47, 1), (53, 52, 1), (78, 77, 1), (79, 76, 10), (83, 81, 1), (89, 87, 1), (28, 18, 1), (29, 27, 10), (31, 22, 1), (32, 30, 10), (50, 49, 1), (51, 19, 2), (54, 53, 1), (55, 23, 2), (84, 83, 1), (85, 82, 10), (90, 89, 1), (91, 88, 10)]
The way to read each of these tuples is as a triple of indices (iDest, iRight, iLeft)
. iDest
tells you the index into prodCache
array to store the result into. The iLeft
and iRight
indices indicate which circuits/sub-circuits the resulting process matrices should be constructed from. This is explained in this comment:
# combine iLeft + iRight => iDest
# LEXICOGRAPHICAL VS MATRIX ORDER Note: we reverse iLeft <=> iRight from eval_tree because
# (iRight,iLeft,iFinal) = tup implies circuit[i] = circuit[iLeft] + circuit[iRight], but we want:
# since then matrixOf(circuit[i]) = matrixOf(circuit[iLeft]) * matrixOf(circuit[iRight])
But since this is a bit confusing I'll step through the first few elements of the evaluation tree explicitly.
(0, None, None)
, this is a special case that most often corresponds to the empty circuit. When both iLeft
and iRight
are None
then we write the identity matrix into the specified iDest
index for prodCache
.(1, None, Label(('Gxpi2', 0)))
when iRight
is None
but iLeft
is not then this is another special case where we write the process matrix directly from the model into the specified location in prodCache
.(3, 1, 1)
, this says that we should take the process matrices currently stored at iRight
and iLeft
in prodCache
, multiply those together, and store the resulting process matrix at index 3 in prodCache
. Since we know that index 1 of prodCache
previously had the process matrix for Circuit([Label('Gxpi2',0)])
stored in it, this means we're calculating and storing the process matrix for Circuit([Label('Gxpi2',0), Label('Gxpi2',0)])
Ok, now that I have explained how the evaluation works, the reason why caching the values of self.model.circuit_layer_operator(opLabel, 'op').to_dense(on_space='minimal')
doesn't work is because in any given call to _compute_product_cache
this only gets called exactly once for each operation in the Model. We also can't (easily) cache this across different calls of _compute_product_cache
because in the context of optimization we are nearly constantly modifying the parameters of the model during the inner loop, and almost never are evaluating this at the same point in parameter space more than once. So the only unconditionally safe place to cache these is local to a single _compute_product_cache
call.
To be clear, I'm not saying that useful caching between different calls to _compute_product_cache
isn't possible at all. We discussed this possibility at a recent development meeting and concluded, however, that doing so safely would rely on building out additional infrastructure that would let us detect when a change in a model's parameter vector touched the parameters of any given child operation so we could perform partial cache invalidation of just those operations we know have changed.
P.S. Just to be explicit, a corollary to the above analysis is that the LRU cache implementation, at least as written, won't work in this context. The issue being that the LRU cache only checks against the function arguments and is unable to detect when the internal state of self.model
changes. Having now written that, we could imagine passing in the model as an argument to this function (or a proxy value rather, as Model objects themselves are not hashable), but this would primarily be of utility in non-GST optimization contexts as per the analysis above.
@coreyostrove Thanks for the extensive explanation, it did improve my understanding of this part of the code a lot! I will close the issue, if we have other ideas for performance improvements we'll open a new one.
The 2-qubit GST is computationally quite expensive. It is possible to improve performance by caching operation label to matrix computations. In particular in
MatrixForwardSimulator._compute_product_cache
the line (https://github.com/sandialabs/pyGSTi/blob/master/pygsti/forwardsims/matrixforwardsim.py#L742-L747)is expensive.
Using a local test with
and replacing the line above with
gate = self.label_to_gate_matrix(opLabel)
performance of the 2-qubit GST can be improved by 15-20%.I am not sure where to place the cache though (and when to invalidate the cache). Options are the
MatrixForwardSimulator
or on theself.model
(apygsti.models.explicitmodel.ExplicitOpModel
object in my test).Feedback on this idea would be welcome.