Performance improvement for 2-qubit GST

eendebakpt commented 5 months ago

The 2-qubit GST is computationally quite expensive. It is possible to improve performance by caching operation label to matrix computations. In particular in MatrixForwardSimulator._compute_product_cache the line (https://github.com/sandialabs/pyGSTi/blob/master/pygsti/forwardsims/matrixforwardsim.py#L742-L747)

  gate = self.model.circuit_layer_operator(opLabel, 'op').to_dense(on_space='minimal')

is expensive.

Using a local test with

    @functools.lru_cache
    def label_to_gate_matrix(opLabel):
            return self.model.circuit_layer_operator(opLabel, 'op').to_dense(on_space='minimal')

and replacing the line above with gate = self.label_to_gate_matrix(opLabel) performance of the 2-qubit GST can be improved by 15-20%.

I am not sure where to place the cache though (and when to invalidate the cache). Options are the MatrixForwardSimulator or on the self.model (a pygsti.models.explicitmodel.ExplicitOpModel object in my test).

Feedback on this idea would be welcome.

rileyjmurray commented 5 months ago

Hi, @eendebakpt! Thanks for doing this digging and for your suggestion. We're swamped right now and expect to be swamped until March Meeting is over, so we can't promise immediate progress on this. That said, we discussed this issue in today's pyGSTi developer meeting and we came to the following conclusions.

We're generally supportive of simple changes that buy us 10% -- 15% savings in expensive routines. In this context we'd want to check a few things before implementing this fix:

Do cache accesses work correctly when running with MPI?
Does caching negatively impact MapForwardSimulator? (If it does we could just isolate the change to MatrixForwardSimulator.)
Is there a deeper inefficiency here that we can resolve even more thoroughly than the caching you've demonstrated?

Of course, the last point is vague and tricky. It really just means "some subset of {Stefan, Corey, Erik} want to look at code profiling results to see if anything else jumps out to them."

With that in mind, can you send us the profiling data you generated? Whatever mechanism works for you is fine with us.

coreyostrove commented 2 months ago

Apologies for how long it has taken to look into your suggestion, @eendebakpt. I've finally done so, but I have unfortunately concluded that simple caching won't work in this instance. At least not in the context of GST optimization.

The method in question here is MatrixForwardSimulator._compute_product_cache and it is a short enough function that is probably worth stepping through it to explain more, but it is also tricky enough that the fact that caching won't immediately help is non-obvious and worth documenting.

def _compute_product_cache(self, layout_atom_tree, resource_alloc):
        """
        #Computes an array of operation sequence products (process matrices).

       #Note: will *not* parallelize computation:  parallelization should be
        #done at a higher level.
       """
        dim = self.model.evotype.minimal_dim(self.model.state_space)

        #Note: resource_alloc gives procs that could work together to perform
        # computation, e.g. paralllel dot products but NOT to just partition
        # futher (e.g. among the wrt_slices) as this is done in the layout.
        # This function doesn't make use of resource_alloc - all procs compute the same thing.

        eval_tree = layout_atom_tree
        cacheSize = len(eval_tree)
        prodCache = _np.zeros((cacheSize, dim, dim), 'd')
        scaleCache = _np.zeros(cacheSize, 'd')

The purpose of the product cache calculation is to, given a precomputed tree structure that identifies intermediate calculations (i.e. process matrices for partial circuit evaluations) to cache for efficient circuit probability calculations across some large set of circuits, construct all of the requisite intermediate results. This tree structure is what is renamed to eval_tree here, and the remaining arrays are for storing the calculated intermediate results.

        for iDest, iRight, iLeft in eval_tree:
            #Special case of an "initial operation" that can be filled directly
            if iRight is None:  # then iLeft gives operation:
                opLabel = iLeft
                if opLabel is None:
                    prodCache[iDest] = _np.identity(dim)
                    # Note: scaleCache[i] = 0.0 from initialization
                else:
                    gate = self.model.circuit_layer_operator(opLabel, 'op').to_dense(on_space='minimal')
                    #gate = initial_op_cache[opLabel]
                    nG = max(_nla.norm(gate), 1.0)
                    prodCache[iDest] = gate / nG
                    scaleCache[iDest] = _np.log(nG)
                continue

            # combine iLeft + iRight => iDest
            # LEXICOGRAPHICAL VS MATRIX ORDER Note: we reverse iLeft <=> iRight from eval_tree because
            # (iRight,iLeft,iFinal) = tup implies circuit[i] = circuit[iLeft] + circuit[iRight], but we want:
            # since then matrixOf(circuit[i]) = matrixOf(circuit[iLeft]) * matrixOf(circuit[iRight])
            L, R = prodCache[iLeft], prodCache[iRight]
            prodCache[iDest] = _np.dot(L, R)
            scaleCache[iDest] = scaleCache[iLeft] + scaleCache[iRight]

            if prodCache[iDest].max() < _PSMALL and prodCache[iDest].min() > -_PSMALL:
                nL, nR = max(_nla.norm(L), _np.exp(-scaleCache[iLeft]),
                             1e-300), max(_nla.norm(R), _np.exp(-scaleCache[iRight]), 1e-300)
                sL, sR = L / nL, R / nR
                prodCache[iDest] = _np.dot(sL, sR); scaleCache[iDest] += _np.log(nL) + _np.log(nR)

        nanOrInfCacheIndices = (~_np.isfinite(prodCache)).nonzero()[0]  # may be duplicates (a list, not a set)
        # since all scaled gates start with norm <= 1, products should all have norm <= 1
        assert(len(nanOrInfCacheIndices) == 0)

        return prodCache, scaleCache

This is the main loop for evaluating the prescription in eval_tree. To understand what this is doing, it is helpful to have an actual evaluation tree to point at, so here is an example of one of these for 1-qubit GST with an XYI gate set (for a very small GST experiment with maximum depth of 1).

eval_tree=[(0, None, None), (1, None, Label(('Gxpi2', 0))), (2, None, Label(('Gypi2', 0))), (56, None, Label(())), (3, 1, 1), (6, 1, 2), (9, 2, 1), (10, 2, 2), (57, 56, 1), (58, 56, 2), (62, 1, 56), (68, 2, 56), (4, 3, 1), (5, 10, 2), (11, 9, 1), (14, 3, 2), (24, 9, 2), (33, 6, 1), (34, 6, 2), (38, 10, 1), (59, 57, 1), (63, 62, 1), (64, 62, 2), (69, 68, 1), (70, 68, 2), (74, 3, 56), (7, 4, 1), (8, 34, 2), (12, 11, 1), (13, 5, 2), (17, 4, 2), (20, 5, 1), (35, 33, 1), (39, 38, 1), (42, 14, 1), (43, 14, 2), (60, 59, 1), (61, 58, 10), (65, 63, 1), (71, 69, 1), (75, 74, 1), (76, 74, 2), (80, 4, 56), (86, 5, 56), (15, 7, 1), (16, 43, 2), (21, 20, 1), (25, 12, 1), (26, 24, 10), (27, 7, 2), (30, 20, 2), (36, 35, 1), (37, 8, 2), (40, 39, 1), (41, 13, 2), (44, 42, 1), (47, 17, 1), (48, 17, 2), (52, 13, 1), (66, 65, 1), (67, 64, 10), (72, 71, 1), (73, 70, 10), (77, 75, 1), (81, 80, 1), (82, 80, 2), (87, 86, 1), (88, 86, 2), (18, 15, 1), (19, 48, 2), (22, 21, 1), (23, 41, 2), (45, 44, 1), (46, 16, 2), (49, 47, 1), (53, 52, 1), (78, 77, 1), (79, 76, 10), (83, 81, 1), (89, 87, 1), (28, 18, 1), (29, 27, 10), (31, 22, 1), (32, 30, 10), (50, 49, 1), (51, 19, 2), (54, 53, 1), (55, 23, 2), (84, 83, 1), (85, 82, 10), (90, 89, 1), (91, 88, 10)]

The way to read each of these tuples is as a triple of indices (iDest, iRight, iLeft). iDest tells you the index into prodCache array to store the result into. The iLeft and iRight indices indicate which circuits/sub-circuits the resulting process matrices should be constructed from. This is explained in this comment:

# combine iLeft + iRight => iDest
# LEXICOGRAPHICAL VS MATRIX ORDER Note: we reverse iLeft <=> iRight from eval_tree because
# (iRight,iLeft,iFinal) = tup implies circuit[i] = circuit[iLeft] + circuit[iRight], but we want:
# since then matrixOf(circuit[i]) = matrixOf(circuit[iLeft]) * matrixOf(circuit[iRight])

But since this is a bit confusing I'll step through the first few elements of the evaluation tree explicitly.

(0, None, None), this is a special case that most often corresponds to the empty circuit. When both iLeft and iRight are None then we write the identity matrix into the specified iDest index for prodCache.
(1, None, Label(('Gxpi2', 0))) when iRight is None but iLeft is not then this is another special case where we write the process matrix directly from the model into the specified location in prodCache.
Jumping ahead to (3, 1, 1), this says that we should take the process matrices currently stored at iRight and iLeft in prodCache, multiply those together, and store the resulting process matrix at index 3 in prodCache. Since we know that index 1 of prodCache previously had the process matrix for Circuit([Label('Gxpi2',0)]) stored in it, this means we're calculating and storing the process matrix for Circuit([Label('Gxpi2',0), Label('Gxpi2',0)])
Rinse and repeat, iteratively building up the process matrices for larger and larger circuits, all the while re-using and caching intermediate results.

Ok, now that I have explained how the evaluation works, the reason why caching the values of self.model.circuit_layer_operator(opLabel, 'op').to_dense(on_space='minimal') doesn't work is because in any given call to _compute_product_cache this only gets called exactly once for each operation in the Model. We also can't (easily) cache this across different calls of _compute_product_cache because in the context of optimization we are nearly constantly modifying the parameters of the model during the inner loop, and almost never are evaluating this at the same point in parameter space more than once. So the only unconditionally safe place to cache these is local to a single _compute_product_cache call.

To be clear, I'm not saying that useful caching between different calls to _compute_product_cache isn't possible at all. We discussed this possibility at a recent development meeting and concluded, however, that doing so safely would rely on building out additional infrastructure that would let us detect when a change in a model's parameter vector touched the parameters of any given child operation so we could perform partial cache invalidation of just those operations we know have changed.

P.S. Just to be explicit, a corollary to the above analysis is that the LRU cache implementation, at least as written, won't work in this context. The issue being that the LRU cache only checks against the function arguments and is unable to detect when the internal state of self.model changes. Having now written that, we could imagine passing in the model as an argument to this function (or a proxy value rather, as Model objects themselves are not hashable), but this would primarily be of utility in non-GST optimization contexts as per the analysis above.

eendebakpt commented 1 month ago

@coreyostrove Thanks for the extensive explanation, it did improve my understanding of this part of the code a lot! I will close the issue, if we have other ideas for performance improvements we'll open a new one.

sandialabs / pyGSTi

Performance improvement for 2-qubit GST #402