keileg commented 5 months ago

Background

In the Ad operator framework, the classes SubdomainProjection, MortarProjection, Trace and InvTrace all represent mappings betweeen grid entities of various kinds. These are all implemented as projection matrices that, during parsing of an operator tree, are left multiplied with arrays and matrices. This is unsatisfactory for two reasons:

Constructing the projection matrices is slow and adds significantly to the computational cost of working with the Ad operators. As a partial remedy, some if not all of the projection classes construct the projection matrices during initializationof the projection object, but this to a large degree just moves the computational burden. The ahead-of-time construction will also be hard to make compatible with the caching outlined in #1181.
Projection operations are in reality slicing (for a global-to-local restriction) or filling in of selected rows in large arrays and matrices (for local-to-global prolongation). Representing this as matrix-matrix / matrix-vector products likely adds significantly to the computational cost.

Suggested change

The projection methods, say SubdomainProjection.face_restriction(), now return pp.ad.SparseArrayss. Instead, it should return a new slicer-object that roughly looks like this:

class RestrictionSlicer:

    self._target_indices: np.ndarray
    # EK note to self: In the future, this may also be a PETSc IS object

    self._weights
    # Weights to be assigned to the different rows

    def __init__(self, int: dim, target_indices: np.ndarray, weights: np.ndarray) -> None:
        self._target_indices = target_indices
        self._weights = weights

    # We may need to use typing overload here to make sure mypy is happy: The return type
    # should always be the same as the input other.
    def __matmul__(self, other: pp.ad.AdArray | np.ndarray) -> pp.ad.AdArray | np.ndarray:
        # In its simplest form (if all of self._weights is one), this can be implemented as
        return other[self._target_indices]
        # However, for non-unitary weights, we need special treatment. For np.ndarrays, this is something like
        return other[self._target_indices] * self._weights
        # AdArrays are a bit more complicated: In principle we can make a diagonal matrix of the weights and 
        # right multiply, so
        scaling = sps.dia_array((self._weights, 0), shape=(self._weights.size, self._weights.size)))
        return scaling @ other[self._target_indices]
        # This risks being slow, though (profiling is needed). The alternative is to work directly with the sparse data structure
        # in the AdArray, using pp.matrix_operations.rldecode to expand the weight array to the right size
        # (can be different from one row to the next). A rough outline of the code is:
        tmp_mat = other[self._target_indices]
        data_ind = pp.matrix_operations.rldecode(np.arange(tmp_mat.jac.shape[0]), np.diff(tmp_mat.indptr))
        tmp_mat.jac.data *= self._weights[data_ind]
        return tmp_mat
        # A final question is whether slicing (all calls above to other[self._target_indices]) is
        # sufficiently efficient, or if we should rather construct a new array by exploiting the CSR format
        # in scipy.arrays. Again, profiling will be needed.

    # NB: The other dunder methods (__add__ etc.) should possibly raise an error.

class ProjectionSlicer:

    self._target_indices: np.ndarray
    self._target_num_rows: int

    self._weights: np.ndarray

    def __init__(self, int: dim, mdg: MixedDimensionalGrid, domain: list[pp.Grid]) -> None:
        # self._target_indices and self._target_num_rows are computed here

    def __matmul__(self, other: pp.ad.AdArray | np.ndarray) -> pp.ad.AdArray | np.ndarray:
        if isinstance(other, np.ndarray):
            result = np.ndarray(self._target_size, dtype=other.dtype)
            result[self._target_ind] = other
        else:  # pp.ad.AdArray
            res_val = np.ndarray([self._target_size, dtype=other.val.dtype)
            res_jac = sps.csr_matrix((self._target_size, other.jac.shape[1]))
            res_val[self._target_ind] = other.val
            res_jac[self._target_ind] = other.jac
            return pp.ad.AdArray(res_val, res_jac)

Comments:

It is not clear whether we need one slicer each for SubdomainProjection, MortarProjection etc.
Similar thoughts may apply to the geometry basis functions pp.models.geometry.basis, research is needed.
The implementation should change the Ad backend only, no changes in the computational grid.
Testing will be needed.

Regarding the divergence class

The class pp.ad.Divergence represents a differential operator rather than a mapping. This cannot readily be implemented as slicing, and although improvements should be possible there as well, it will not be part of this issue.

Dependencies

This can be done independently of #1179, #1181, but should not be done in parallel.

keileg commented 5 months ago

Some of the functionality in po.models.geometry may also be suited for this approach.

IvarStefansson commented 5 months ago

Some of the functionality in po.models.geometry may also be suited for this approach.

That's an interesting comment. Is there any chance this approach can help us with representation of tensors and their products?

keileg commented 5 months ago

Perhaps, it depends on what you mean. I suggest we discuss in person.

keileg commented 1 week ago

Implementing this will require three main steps, each split into separate tasks.

Preparatory steps

[ ] Improve test coverage for the Ad version of the projections, in particular the mortar projections. Covered in #1263.
[ ] Prepare benchmarking for use in models. This is implemnted in #1261

Implementation of the new projection operators

[ ] Implement a first version of the RestrictionSlicer and ProjectionSlicer, outlined above. There are several suggestions for how to implement __matmul__, we probably should implement all, and let benchmarking (see below) decide which to use.
[ ] Make some simple tests to verify the classes. Copilot should be able to handle this.
[ ] Prepare benchmarking for the slicing operators, stand-alone (used on np.ndarrays and pp.ad.AdArrays. Hopefully, copilot can help with this. It is important to consider both small and big arrays, and slicing/prolongation to few and many of the rows. We should also test the existing implementation of multiplying with a sparse matrix.
[ ] Try to use the benchmarking to decide which of the implementations of __matmul__ should be used. It could be that the results are similar for some cases, in which case we either prioritize simplicity, or keep several options for a while.

Inclusion of the new slicers in the Ad operator system

[ ] Introduce the new slicer classes in the SubdomainProjection, to replace the existing projections. It could be that we need to introduce a new Ad operator to ensure delayed evaluation (the existing projections use SparseArrays), but this should be a relatively simple task.
[ ] Introduce the new slicer classes in the Ad parsing. My feeling is we need few if any changes to the parsing in pp.ad.Operator, but we may need to do modifications to pp.ad.AdArray to let these know how to deal with the new classes (simply leave the operation to the new class by calling its __matmul__)
[ ] Expand the test suite for Ad parsing to cover the new objects.
[ ] Modify the existing test of SubdomainProjections. Currently this is based on the projection matrix being explicitly available, but when we only have the action of the slicer, we need to construct the matrix (multiply with one unit vector at a time, build the matrix this way).
[ ] Benchmarking of the new results, using the full Ad framework, again with small and large arrays, small and large slices. Copilot should be able to help. Compare with the existing implementation of the SubdomainProjection.
[ ] Introduce the new functionality also to the MortarProjection. Modify all projections, but leave out the sign_of_mortar_side for now.

keileg commented 1 week ago

Benchmarking (to be updated)

Observations from benchmarking using line_profiler (which btw seems to have a substantial overhead with my setup, but comparing relative performance, rather than absolute numbers, should give some indications)

Running flow_benchmark_2d_case3, with a grid of 24K cells in the matrix, ~800 mortar cells

Substantial time is spent in the mortar projections. Less so in the subdomain projections (but this is reasonable, the physics is homogeneous between the subdomains).
The time spent on defining the equations (which will include the projections) is roughly equal to that of mpfa discretization. There are indications that the mortar projections take the main part of this construction, though this is not certain, given the noise in profiling.
Time for assembly is (for this case) considerable less than defining equations and discretizing.
Time for linear solve is also not that big.

Conclusion so far: It seems the projection update project can make an impact.

IvarStefansson commented 1 week ago

This may be a stupid question: How will the slicing work for non-matching mortar grids?

keileg commented 1 week ago

This may be a stupid question: How will the slicing work for non-matching mortar grids?

It will be more complicated, involving not only slicing, but weighted sums over rows. I haven't worked out the details, but if it gets too complicated, it could be that we end up using the old solution for non-matching grids, at least until we start using these in earnest.

keileg commented 1 week ago

Benchmarking of the actual slicing and reconstruction

There turned out to be two natural ways to implement these operations:

Operating directly on the sparse data structures as outlined in the initial posting of this issue. Referred to as 'Option 1' below'.
Explicitly constructing a restriction/prolongation matrix. Referred to as 'Option 2' below.

Comments:

Option 2 is similar to what is being done now in pp.ad.SubdomainProjections, however, if we go for this solution, we will not rely on sps.bmat to construct a combined matrix for multiple subdomains, but rather concatenate indices and construct a single matrix (could be substantially faster).
Option 1 is rather straightforward to implement for a conforming grid, but will become highly technical for non-conforming grids, where the matrix elements of the corresponding projection matrix are non-unitary. In practice, we will end up trying to implement a general matrix-matrix (or matrix-vector) product, and we can hardly expect to improve on scipy there. The reason we may improve things for conforming grids is that in this case the matrix product is in effect slicing, and this insight may be exploited.
If we ever introduce a PETSc backend for the Ad operations, Option 2 may benefit from this (assuming PETSc speeds up calculations). Option 1 will require that we figure out how to efficiently manipulate the PETSc sparse matrix format.
Based on inspection of the source code, both options are expected to some improvements on the current implementation of SubdomainProjection and significant improvements to MortarProjection (which is horribly inefficient).

The benchmarks:

Two md geometries were considered:

The regular geometry (case 1) from the 2d flow benchmark, with 400, 24K and 2400K cells in total.
Case 3 from the 3d flow benchmarks, with 30K, 150K and 350K cells in total.

We test both restriction to and prolongation from each subdomain. This is representative for both subdomain and mortar projections, with the assumption that we treat the task of identifying indices to project to and from different from the task of mapping. By similar reasoning, mapping to sets of domains is also covered.

We comment below on time for initialization, and of mapping of vectors, sparse matrices and AdArrays. In Option 2, initialization entails construction of a projection matrix, while Option 1 only stores passed information during initialization. Thus the relative importance of initialization cost depends on how many times a restriction/prolongation object is used; currently this is often only once, but future caching (#1181) may change this.

Results

Summarized as follows:

Generally, prolongation is more costly than restriction, reflecting the need to construct and operate on larger vectors and matrices. The difference increases with cell count.
Initialization is fastest with Option 1, with a cost of roughly 2-20% of that for Option 2.
Restricting and prolonging a vector with Option 1 takes 20-100% of the corresponding time for Option 2. For Option 1, the time for vector operations and initialization are of the same order of magnitude (though the scaling of operations is poorer than that of initialization), while for Option 2, initialization can be 1-2 orders of magnitude more costly.
For matrix operations, the scaling is roughly linear for Option 1, while for Option 2 it is in general slightly worse than linear. In total, the cost for Option 1 is about 20-100% of that for Option 2, with the difference increasing with matrix size. Testing on matrices with different number of nonzeros per row indicated that Option 1 can be considerably better than Option 2 in some respect (improved prolongation, no difference in behavior for restriction). The cost of initialization is negligible compared to matrix operations.
For Ad arrays, Option 1 takes 50-100% of the time for Option 2, with better scaling with cell count.

Conclusions

Option 1 is significantly faster than Option 2, in particular for larger grids where the difference matters the most. Option 1 is therefore the preferred method.
The need to stay compatible with non-conforming grids means we also will need Option 2. The projection objects will need to decide which restrictor / prolongator to use based on the provided weights.

Next steps

Implement the new classes for restriction and prolongation as helpers in grid_operators.py.
Refactor the SubdomainProjection to use the new classes. This will entail deciding which indices to use based on input tot the methods cell_prolongation() etc. Note that for the SubdomainProjection, the weights will always be unitary.
Introduce tests of the new classes.
Do benchmarking to compare performance with the existing implementation.
Start thinking of the MortarProjection

pmgbergen / porepy

Reimplementation of Ad projection operators #1182

Background

Suggested change

Regarding the divergence class

Dependencies

Preparatory steps

Implementation of the new projection operators

Inclusion of the new slicers in the Ad operator system

Benchmarking (to be updated)

Benchmarking of the actual slicing and reconstruction

The benchmarks:

Results

Conclusions

Next steps