Tpetra: How do I optimally fill a `FECrsGraph` if I've precomputed the graph?

kubagalecki commented 1 year ago

Filling an a priori known `FECrsGraph`

Tagging @csiefer2 as the package owner

Hi, I have a performance question regarding the fill of a precomputed FECrsGraph. I have precomputed all the data for the fill of a sparse graph, including the column map and all the (locally indexed) column entries for each row. I would now like to pass that information on to FECrsGraph. If I were creating a CrsGraph, I'd just use the constructor taking the row pointers and column indices: https://github.com/trilinos/Trilinos/blob/d1f042da58a3e1de8bb198a5bfc3821608db625b/packages/tpetra/core/src/Tpetra_CrsGraph_decl.hpp#L486-L490

Instead, since such a constructor is not available, I have to fill the graph row by row by calling insertLocalIndices, along the lines of:

for (local_index_t local_row = 0; local_row != n_local_rows; ++local_row)
{
    std::span<const local_index_t> cols_view = my_graph.getRow(local_row); // These are sorted and unique
    tpetra_graph->insertLocalIndices(local_row, asTeuchosView(cols_view));
    ++local_row;
}

This results in a horrific performance hit, since, insertLocalIndices checks for duplicates etc. By horrific I mean it takes around 3 orders of magnitude more to insert the entries into the graph than it takes to compute it (i.e. precompute my_graph from the snippet above). Down the line, this means I'm taking more time initializing my algebraic problem than solving it. To be clear, I'm not knocking on insertLocalIndices, it's just not the right tool for the job.

Ideally, I'd like to call FECrsGraph::setAllIndices. Unfortunately, when I do, I get the following error:

Throw test that evaluated to true: ((this->lclIndsUnpacked_wdv.extent (0) != 0 || this->gblInds_wdv.extent (0) != 0))

Tpetra::FECrsGraph<int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::OpenMP, Kokkos::HostSpace> >::setAllIndices: You may not call this method if 1-D data structures are already allocated.

My specific questions are as follows:

How can I construct a FECrsGraph so that I can call setAllIndices on it?
If the answer to Q1 is "you can't", given the pattern above (insertion never overlaps), is it safe to parallelize the for loop from the snippet above? This would alleviate the performance issue in a big way. I've tested it and it seems to work, but obviously multi-threading is not an area where "hey this seems to work" is a sufficient proof of correctness.
If the answer to Q1 is "you can't", would it be possible to implement a constructor in FECrsGraph similar to the one cited above for CrsGraph? If it is possible from a technical perspective, but your team does not have the bandwidth to do it, would you please consider giving me some pointers so that I could contribute such a constructor myself?
As an alternative solution, would you consider adding to FECrsGraph an "expert level" member function insertLocalRowSortedUniqueEntries (or one with a less awful name), which would avoid checking for duplicates etc.? This would also resolve my performance problem, as it would essentially turn the above code snippet into a glorified memcpy. Again, I'd consider contributing if you don't have the bandwidth. For easy reference, the implementation details of insertion as they currently stand can be found below. As you can see it does all sorts of things to handle duplicates, including constructing an STL hashmap as a lookup table:

https://github.com/trilinos/Trilinos/blob/d1f042da58a3e1de8bb198a5bfc3821608db625b/packages/tpetra/core/src/Tpetra_Details_crsUtils.hpp#L410-L506

Thanks in advance for your time, I appreciate the work y'all do!

csiefer2 commented 1 year ago

@kubagalecki @brian-kelley is working on an "easy" interface for on-node graph assembly. That might be a good choice for what you're trying to do. It also might not be what you want, if you're doing that all that work yourself anyway.

I would also add that the FECrsGraph contains data for any off-rank entries in off-rank rows you want to insert as well as the on-rank ones -- that's special sauce for the "FE" part. If you have all of that and you're sure you got it right, then a setAllIndices function would be what you want. And if you have all the ownedPlusShared stuff done in one big data structure, you can use the base class setAllIndices function. Which should work (though I suspect we don't test it so it might not).

Update: We don't test it and it doesn't work. We can look into it.

kubagalecki commented 1 year ago

@csiefer2 thanks for getting back to me so quickly. The reason I need a FECrsGraph is that I want to construct a FECrsMatrix, which I happily use for off-node communication. However, I can more efficiently compute the graph (i.e. the owned and owned+shared row maps, the column map, and all column entries) in my own code. I just need setAllIndices to efficiently pass that information to the FECrsGraph. I'd greatly appreciate if you could look into getting it to work. Should I submit a separate bug report?

In the mean time, calling insertLocalIndices in parallel could really help me out. I'll ask the question in full for better google-ability:

Question: is insertLocalIndices thread-safe, assuming all insertions are done in different rows, and sufficient memory has been allocated? Additionally we can assume the column entries passed are sorted and unique, but I don't think that's relevant for thread-safety.

Again, thank you for your help, I really appreciate it!

Details of my use-case (please feel free to ignore)

I've taken the liberty of providing a more detailed description of what I'm doing below, please feel free to ignore it, but maybe you collect use-cases for admin-related purposes or something. I'm working on a finite element code. Without getting into too much detail, here are a few relevant features: - It's high order, meaning there will be a large number of non-zero entries in each row of the graph, possibly numbering in the thousands - It leverages thread-parallelism (hopefully GPUs in the future, though graph fill happens on host so that's irrelevant for this problem), so I have fewer MPI ranks owning larger portions of the graph. - Nodes in the mesh can have more than one DOF, and different nodes possibly have different numbers of DOFs. This is part of the reason that I can compute the row and column maps more efficiently than Tpetra, since a lot of the information can be communicated in terms of nodes, not DOFs. - My graph fill algorithm is more efficient than Tpetra's (for my case), because I've made different trade-offs (see, e.g., #4279). The algorithm is as follows, in case that's relevant: 0. Compute the row and column maps for the graph. 1. Overallocate memory for each row. 2. In parallel (in terms of elements), append the column entries to each row. Because we're only appending, this can be done in a lock-free fashion. At the end, the entries for the graph form a contiguous memory region. 3. Sort and deduplicate each row's entries at the end. This is trivially parallel. The entries are no longer contiguous in memory. 4. If needed (e.g. for `setAllIndices`), "condense" the deduplicated entries so that they form a contiguous memory region. 5. Pass the computed graph to FECrsGraph. Currently this is done by sequentially calling `insertLocalIndices` for each row 6. Create a FECrsMatrix and FEMultiVector based on the FECrsGraph 7. Do finite element stuff :) ### Small example benchmark *run on my 12-core desktop, the discrepancy gets worse for larger cases run on a cluster* Setup: - 6x6x6 cube of 6th order hex elements - static condensation of the internal nodes - 4 DOFs per each node - Resulting matrix is 94,612-by-94,612 with 143,849,104 non-zero elements - 1 MPI rank Results: Step(s) | Time --- |--- 1-3 | 0.243s 5 | 77.79s 7 | 106.28s As you can see, passing the graph data to FECrsGraph is by far the bottleneck. In fact, depending on the setup, it can sometimes take longer than the downstream algebraic solve.

csiefer2 commented 12 months ago

Question: is insertLocalIndices thread-safe, assuming all insertions are done in different rows, and sufficient memory has been allocated? Additionally we can assume the column entries passed are sorted and unique, but I don't think that's relevant for thread-safety.

Under the "no clashes within a row" assumption, probably. We don't test it though. You will need to enable thread safe RCPs if you're going to try to use them w/i an OpenMP parallel region.

csiefer2 commented 12 months ago

@brian-kelley 's fast graph assembly stuff might also be an option. And that actually gets tested.

trilinos / Trilinos