trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.21k stars 564 forks source link

MueLu: memory overflow in BlockedRepartition #11883

Open jhux2 opened 1 year ago

jhux2 commented 1 year ago

Bug Report

@trilinos/muelu

Test "BlockedRAPFactoryWithDiagonal" has a memory problem, according the Trilinos memory sanitizer.

https://trilinos-cdash.sandia.gov/viewDynamicAnalysisFile.php?id=52112

github-actions[bot] commented 1 year ago

Automatic mention of the @trilinos/muelu team

GrahamBenHarper commented 1 year ago

The report points nicely to packages/muelu/src/Misc/MueLu_CoordinatesTransferFactory_def.hpp:246:44 in MueLu::CoordinatesTransferFactory >::Build(MueLu::Level&, MueLu::Level&) const::'lambda'(int)::operator()(int) const, which is the sum+= line in here:

        for (size_t j = 0; j < dim; j++) {
          Kokkos::parallel_for("MueLu:CoordinatesTransferF:Build:coord", Kokkos::RangePolicy<local_ordinal_type, execution_space>(0, numAggs),
                                KOKKOS_LAMBDA(const LO i) {
                                  // A row in this graph represents all node ids in the aggregate                                                                                                                                                                                                                    
                                  // Therefore, averaging is very easy                                                                                                                                                                                                                                               

                                  auto aggregate = aggGraph.rowConst(i);

                                  typename Teuchos::ScalarTraits<Scalar>::magnitudeType sum = 0.0; // do not use Scalar here (Stokhos)                                                                                                                                                                               
                                  for (size_t colID = 0; colID < static_cast<size_t>(aggregate.length); colID++)
                                    sum += fineCoordsRandomView(aggregate(colID),j);

                                  coarseCoordsView(i,j) = sum / aggregate.length;
                                });
        }

I suppose there's an issue with blocking/striding/etc in CoordinatesTransferFactory?

Also, it looks like it immediately exits out of the unit tests once it detects that overflow. I have a feeling that fixing this one may reveal subsequent issues in the remaining unit tests that were skipped.

github-actions[bot] commented 5 months ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE. If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

mayrmt commented 5 months ago

We probably want to keep this open.