Decoupling of assembler and cells

Psirus commented 6 years ago

What's the problem?

Currently, the assemble has functions like

GlobalDofVector BuildVector(const Group<CellInterface>& cells, CellInterface::VectorFunction f) const;

and in there does things like calling cell.Integrate(f). These things (cells, functions-to-be-integrated) are not really part of assembly. Ideally it would just get a vector (matrix) and the numbering and insert them it the appropriate place into the global matrix.

Problems:

Does more than it should: assembly + call cell.Integrate()
Unnecessary dependency on Group or CellInterface
Can't assemble matrices that don't originate from cells
Modifying the local matrix before assembly leads to duplicate code (BuildDiagonallyLumpedMatrix)

Proposal

Instead of writing the cells and functions directly at the assembler, the assembler would get a generator that will return vector (matrix) + numbering when asked. The assembler does not, and should not, care how these are generated. Working with functions and cells could be a derived class / implementation of a generator base / concept. If additional transformations are to be applied to a local output (mass lumping, condensation), these can be performed in the generator.

Code

The return type of such a generator will be assumed to be using Entry = std::pair<Eigen::VectorXd, Eigen::VectorXi>;. A simple implementation could be

class EntryGenerator
{
public:
    virtual Entry next() = 0;
    virtual bool isValid() = 0;
};

which would allow the assembler to do something like this

while(generator.isValid())
{
    Entry entry = generator.next();
    globalVec[entry.second] = entry.first;
}

This is nice because it is dead simple. Another approach, for fans of syntactic sugar and range-based for loops:

class EntryGenerator
{
public:
    virtual EntryIterator begin() = 0;
    virtual EntryIterator end() = 0;
    // ...
};

could result in nice syntax like this

for (Entry entry : generator)
{
    globalVec[entry.second] = entry.first;
}

However, this results in a bit of code for the iterators. A complete example can be seen here.

Questions

Do you like the idea?
What would an Entry look like?
Dead simple generator vs iterators?
Dynamic vs static polymorphism of generators?
<insert-your-concern-here>

pmueller2 commented 6 years ago

Good! If I understand it correctly, based on the current implementation of the assembler: the loops over cells and dof types are then done via generator.next(). And the entry the generator returns is the result of cell.integrate(f)[someDof] and cell.DofNumbering(someDof), yes?

joergfunger commented 6 years ago

I totally agree, the assembler should only assemble a matrix/vector without performing the loop over the cells. We should however discuss, where exactly this class is stored. IMO, this belongs to the problem (see for example TimeDependentProblem).

TTitscher commented 6 years ago

I think there can be nothing bad be found about this idea. Splitting tasks and dependencies is great! Some thoughts:

In your code example - which I honestly find quite complicated (will .isValid() and .next() be easier?) - the fact of the lazy evaluation is IMO not present (or I really misunderstand the code, sorry then). It seems like the entry CellEntryGenerator stores all its entries in a vector. This is not feasible for e.g. all dense element matrices. For a better understanding, could you maybe adapt your example to take a bunch of objects (-> our cell group) and a function (-> our MatrixFunction) and evaluates them, when they are needed?
In OldNuTo, this was the only loop that is OpenMP parallel. Do we want to think about that as well? It may require the iterator to be a random access iterator. No idea how that will work. But maybe: YAGNI when we focus on MPI parallelization.

Would it be nice to have this whole loop internal, like std::transform? We may end up with

GlobalSparseMatrix m;
auto ToGlobalMatrix = [&](Entry& e)
{
NuTo::Assemble(m, e);
// or 
NuTo::Assemble(m, MakeLumpled(e));
};
NuTo::LazyEvaluateLoop(cells.begin(), cells.end(), HessianFunction, ToGlobalMatrix);

(We could then add something like an execution policy for OpenMP)

Where would we put this code? In each main file? Only internally - e.g. wrapped in an Assembler class? This may also influence the design.

Psirus commented 6 years ago

The generator stores these just for the example. I didn't want to write groups, cells and functions. Seems this was confusing, sorry for that. Of course I'm not proposing to store all the local matrices.
Personally: YAGNI. If anyone is very interested in parallelization and sees a problem with this idea: I'm not set on these generators, I just want the decoupling.
Looks good as well. A more complete example would be interesting.

At first, I would just put it in the assembler:


// in main / EquationSystem / TimeDependentProblem / WhateverWrapperIsHipRightNow
CellEntryGenerator gen(cells, gradientFunction);
Assembler assembler(gen);

// in Assembler GlobalVector globalVec; for (Entry entry : mGenerator) { globalVec[entry.second] = entry.first; }


Refactoring it out can happen in a second step if necessary.

TTitscher commented 6 years ago

Another IMO huge advantage could be a decoupling from CellInterface. On first glance, we could have a TimeDependentCell with methods like Gradient(t, dt) and Hessian(t, dt). This cell does not necessarily implement CellInterface. With corresponding entry generators like GraidentGenerator(cells, t, dt) and HessianGenerator(cells, t, dt) we could still assemble those.

TTitscher commented 6 years ago

I added some template stuff to the prototype of @Psirus. The main advantage: The methods begin() and end() that are the same for all classes, are now defined in the common interface. Someone, who wants to create his own generator, only needs to implement Next(), IsValid(), Get(). Code can be found here.

TTitscher commented 6 years ago

I implemented a version of that in this branch. There is, e.g. a DofMatrixGenerator that is the base class of CellMatrixEntries (Naming...?). Things I noticed:

For now, we have to pass via nonconst reference, which is a bit bad.
- Why reference? Since DofMatrixGenerator is an abstract class, we cannot pass by value.
- Why nonconst? Because DofMatrixGenerator::Get() or DofMatrixGenerator::begin() are not const. Get(), for example, modifies the state. (We could make everything mutable...)
- Why is this a problem? We cannot pass temporaries in the assembler method and always need some extra line that declare the generator.
For now, we cannot reuse the generator. Once IsValid() is false, it is useless. Especially, in combination of having to pass it via reference, this is ugly. Introducing a Reset() may solve the problem. But when is that called? At the beginning of the assembly? Manually? Or have a Clone() method instead that returns a new valid instance?
For now, there is CellMatrixEntries, CellVectorEntries, CellLumpedMatrixEntries that all work on the CellInterface and share a lot of code. But I cannot think of an easy refactoring.
For now, we have all this template code that is only specialized for MatrixEntry and VectorEntry. So I feel like moving it from base to mechanics and specialize it for those types in a source file. Why mechanics? Because both entry types need mechanics/dofs/DofVector. [easy fix...]

TTitscher commented 6 years ago

The recent approach was to have some iterator class that performs the loop. So the assembler assembles only matrices and vectors but still performs the loop over all cells. I think, splitting that into features

loop over cells (or something else that generates element contributions)
adding an element contribution to a global matrix/vector

may simplify this a lot.

Point 1) may result in a very short (templates may be viable) class/function like

//! TFunction would return an equivalent of std::pair(Cell.Integrate(...), Cell.DofNumbering(...)
template <typename TBegin, typename TEnd, typename TFunction>
GlobalMatrix Assemble(TBegin begin, TEnd end, TFunction f, /*dof info, dof types...*/)
{
   MatrixAssembler m;
   for (;begin != end; ++begin)
      m.Add(f(*begin));
   return m.Get();  
}

Or, for simplicity, be specialized for certain iterator types and TFunction.

For point 2) we can even think of the old NuTo way of moving the assembly to the global types like

GlobalMatrix::Add(DofMatrix<double>& elemOutput, DofMatrix<int>& dofNumbers);
GlobalVector::Add(DofVector<double>& elemOutput, DofMatrix<int>& dofNumbers);
GlobalVector::AddLumped(DofMatrix<double>& elemOutput, DofMatrix<int>& dofNumbers);

Or we put that into classes like MatrixAssembler/VectorAssembler/LumpedAssembler (as in the Assemble(..) code above) . This may make reusing of exising sparsity patterns easier.

Psirus commented 6 years ago

We can split that into two, but I'm not sure how much we gain by that. Specifically, to the points in your post above:

the DofGenerator can not be a temporary
it can't be reused

still hold. So while I think splitting is fine, that wasn't "the problem", right?

As for 2): Reusing a sparsity pattern in the assembler seems a good idea. Making the vectors/matrices assemble themselves - I personally don't like that; should not be their concern imho.

TTitscher commented 6 years ago

I was thinking about the assembler that really assembles vectors/matrices. And I would like to implement that as soon as the RemoveJ/K team #164 finished. I am especially interested in reusing the sparsity patterns.

Basic interface

MatrixAssembler::MatrixAssembler(DofInfo = {});
void MatrixAssembler::Resize(DofInfo);
void MatrixAssembler::Add(DofMatrix<double> contribution, DofMatrix<int> numbering);
const DofMatrixSparse& MatrixAssembler::Get() const;

Internally, I would probably use vectors of Eigen::Triplet to efficiently build the matrix. In the Get() method, they would be transferred via setFromTriplets to the real sparse matrices (const attribute questionable). This would cover the current features and is decoupled from CellInterface. We can also think of an LumpedMatrixAssembler or have

void MatrixAssembler::AddLumped(DofVector<double>, DofVector<int>);

Reuse nonzeros

As a new feature, we can try to reuse the sparsity. So once the nonzeros are known - like after the first assembly run - we can assemble directly with coeffRef(), instead of the triplet lists. This is supposed to be slightly faster, more memory efficient and avoids the memory allocations. But I am not sure about the interface.

1) My idea is to have a bool mFinished flag together with a void Finalize/Finish/Complete/MakeCompressed() method. Naming help needed. So we expect the following flow:

MatrixAssembler assembler(); // mFinished = false;
assembler.Resize(dofInfo);

// first evaluation
for (auto cell : cells)
    assembler.Add(cell.Integrate(f), cell.DofNumbering()); // builds into triplets since mFinished == false;
// DoStuff(assembler.Get()); throws, since mFinished == false
// Or Get() internally calls Finish() automatically...
assembler.Finish(); // mFinished = true, converts triplets to matrix, deallocates triplets.
DoStuff(assembler.Get()); // fine now

// second/third/... evaluation
assembler.SetZero();
for (auto cell : cells)
    assembler.Add(cell.Integrate(f), cell.DofNumbering()); // builds directly into matrix since mFinished == true;
// no need to call assembler.Finish();
DoStuff(assembler.Get());

2) Another idea is to allocate nonzeros before using

for (auto cell : cells)
    assembler.AllocateNonzeros(cell.DofNumbering()); // into triplets;

3) Or we introduce a method that copies the nonzeros, e.g. an overload

MatrixAssembler::Resize(DofMatrixSparse other);

4) Or we even introduce two separate assemblers

DirectAssember assember(tripletListAssember.Get());

5) Or, since the coeffRef approach does not require members, we can provide free functions for that.

auto hessian = tripletAssembler.Get();
//...
hessian.SetZero();
for (auto cell : cells)
    Assemble(hessian, cell.Integrate(f), cell.DofNumbering());

But I would prefer the first (automatic using mFinished) or the last (no mFinished, assembler class == triplet lists, free function == coeffRef) solution.

Parallel assembly

The challenge is that the loop over the cells is now done outside of the assembler. Once the nonzeros are known, there is no reallocation involved. A simple mesh coloring (= MaximumIndependentSets) eliminates race conditions on parallel writes into the matrix. The TripletList assembly is not that trivial. We could allocate into numThreads separate assemblers and add up the matrices. Or we concatenate the separate triplet lists. Or we have a separate ParallelAssembler that deals with that. Or we postpone that.

Discussion

Do you like the proposed interface?
- Naming intuitive enough?
When/How to deal with parallel assembly?

nutofem / nuto