trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.21k stars 566 forks source link

Intrepid2 refactor getPhysicalEdge... has a huge overhead on cuda #1151

Closed bathmatt closed 7 years ago

bathmatt commented 7 years ago

So, many routines that get normals and edge tangents call down to

 CellTools<SpT>::
  setSubcellParametrization( subcellParamViewType      &subcellParam,
                             const ordinal_type         subcellDim,
                             const shards::CellTopology parentCell ) {

Which isn't such a big deal except that in this we create some kokkos views


    referenceNodeDataViewType
      v0("CellTools::setSubcellParametrization::v0", Parameters::MaxDimension),
      v1("CellTools::setSubcellParametrization::v1", Parameters::MaxDimension),
      v2("CellTools::setSubcellParametrization::v1", Parameters::MaxDimension),
      v3("CellTools::setSubcellParametrization::v1", Parameters::MaxDimension);

Now, on CudaUVM this takes quite a bit of time. Since we do this once for every element we need to get info from this is a huge amount of time. For a small test this is 2 minutes of the time, vs 6 seconds to fill and solve the matrix that comes out of the system.

Not sure on the best way to resolve this?

You call through to getReferenceVertex which takes a view.

How should we move forward? @kyungjoo-kim @eric-c-cyr @rppawlo @crtrott

bathmatt commented 7 years ago

One suggestion, specialize getReferenceVertex to take a kokkos view or a double* and make the view version to the error checking and call down to the double one and then make v0 a double v0[Parameters::MaxDimension]

eric-c-cyr commented 7 years ago

Alternatively you could have an "expert" usage getReferenceVertex that accesses the static data in a subview way. Basically trading off safety for performance. (I hope I'm understanding the code enough).

E.g. you would add or change the signature of:

  getReferenceNode(       Kokkos::DynRankView<cellNodeValueType,cellNodeProperties...> cellNode,
                    const shards::CellTopology  cell,
                    const ordinal_type          nodeOrd ) ;

to

  Kokkos::DynRankView</*can you add a const here? */ cellNodeValueType,cellNodeProperties...>
  getReferenceNode_expert(
                    const shards::CellTopology  cell,
                    const ordinal_type          nodeOrd ) ;

and then return the subview. This then works its way up through getReferenceVertex

eric-c-cyr commented 7 years ago

A more mild improvement (dimension dependent) would move the allocations under their appropriate condition statement.

eric-c-cyr commented 7 years ago

Also, I'm a little surprised that this date can't be hashed by cell type.

bathmatt commented 7 years ago

Maybe it should return a static const reference to a cell type.

On Sat, Mar 18, 2017 at 11:06 AM, Eric C. Cyr notifications@github.com wrote:

Also, I'm a little surprised that this date can't be hashed by cell type.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/trilinos/Trilinos/issues/1151#issuecomment-287559785, or mute the thread https://github.com/notifications/unsubscribe-auth/AOPDIKEnLJCPRBU5hiukG0QdGdQcekhFks5rnA76gaJpZM4MhCKE .

kyungjoo-kim commented 7 years ago

I expect this problem when I work on cell tools. At that time, I do not want to change API and just convert all functions using dynrankview. Maybe it is time to clean up the interface more useful. We first need to separate whether we want to use a function in a functor or front level (parallel for).

eric-c-cyr commented 7 years ago

@kyungjoo-kim Do we have a timeline for getting a fix?

kyungjoo-kim commented 7 years ago

We need to set up a meeting in person to discuss the problems that @bathmatt found so far. This is not the only problem and I would like to find out design flaws in APIs.

bathmatt commented 7 years ago

Please set this up with myself and @rppawlo

rppawlo commented 7 years ago

This is a large effort. @kyungjoo-kim will set up a meeting to discuss.

kyungjoo-kim commented 7 years ago

@bathmatt

Could you give me your psedo code that causes this performance degrade ?

What I see now from the code are :

Maybe the solution is simple. Do you think that this solve the problem ? Anyway I want to see a big picture how these functions are used in your code and will better be used in the future code.

bathmatt commented 7 years ago

It is from this function that they are called. We get a normal for face on a sideset, looping over the faces.

  Intrepid2::CellTools<PHX::exec_space>::getPhysicalSideNormals(normal, iv.jac.get_view(), *sideID, *(ir->topology));

File ./adapters-stk/src/Panzer_STK_SurfaceNodeNormals.cpp

kyungjoo-kim commented 7 years ago

@rppawlo @bathmatt

There are some problems in Panzer_STK_SurfaceNodeNormals.cpp.
    for ( ; sideID != localSideTopoIDs.end(); ++side,++sideID,++parentElement) {
// blah .... blah ...
      Intrepid2::CellTools<PHX::exec_space>::getPhysicalSideNormals(normal, iv.jac.get_view(), *sideID, *(ir->topology));
}

This is not an ideal case of use the front interface where it evaluate a "single" normal. All front interface calls internally parallel_for, which means that for a single normal evaluation many times of kernel launch are made.

This probably requires another serial interface that evaluates a normal for a single point or multiple points associated with a single cell. I can give you an implementation to see if this can remove the overhead.

Is this the only function that brings such overhead ? Other code that uses tangent or normal ?

rppawlo commented 7 years ago

There are some other calls to normals/tangents. Below is a quick search of panzer.

[rppawlo@gge panzer]$ find . -name '*pp' | xargs grep CellTool
./dof-mgr/src/Panzer_NodalFieldPattern.cpp:#include "Intrepid2_CellTools.hpp"
./dof-mgr/src/Panzer_IntrepidFieldPattern.cpp:#include "Intrepid2_CellTools.hpp"
./dof-mgr/src/Panzer_IntrepidFieldPattern.cpp:      Intrepid2::CellTools<PHX::Device> cellTools;
./adapters-stk/src/Panzer_STK_SurfaceNodeNormals.cpp:#include "Intrepid2_CellTools.hpp"
./adapters-stk/src/Panzer_STK_SurfaceNodeNormals.cpp:      Intrepid2::CellTools<PHX::exec_space>::getPhysicalSideNormals(normal, iv.jac.get_view(), *sideID, *(ir->topology));
./disc-fe/src/Panzer_IntegrationValues2.cpp:#include "Intrepid2_CellTools.hpp"
./disc-fe/src/Panzer_IntegrationValues2.cpp:    Intrepid2::CellTools<PHX::Device::execution_space> cell_tools;
./disc-fe/src/Panzer_IntegrationValues2.cpp:    Intrepid2::CellTools<PHX::Device::execution_space> cell_tools;
./disc-fe/src/Panzer_IntegrationValues2.cpp:    Intrepid2::CellTools<PHX::Device::execution_space> cell_tools;
./disc-fe/src/responses/Panzer_ResponseScatterEvaluator_Probe_impl.hpp:  typedef Intrepid2::CellTools<PHX::exec_space> CTD;
./disc-fe/src/Panzer_Workset_Builder_impl.hpp:#include "Intrepid2_CellTools.hpp"
./disc-fe/src/Panzer_PointValues2_impl.hpp:#include "Intrepid2_CellTools.hpp"
./disc-fe/src/evaluators/Panzer_GatherTangents_impl.hpp:      Intrepid2::CellTools<PHX::exec_space>::getReferenceEdgeTangent(refEdgeTan_local, i, parentCell);
./disc-fe/src/evaluators/Panzer_Normals_impl.hpp:#include "Intrepid2_CellTools.hpp"
./disc-fe/src/evaluators/Panzer_Normals_impl.hpp:    Intrepid2::CellTools<PHX::exec_space>::getPhysicalSideNormals(normals.get_view(),
./disc-fe/src/evaluators/Panzer_ProjectToEdges_impl.hpp:      Intrepid2::CellTools<PHX::exec_space>::getReferenceEdgeTangent(refEdgeTan_local, i, parentCell);
./disc-fe/src/evaluators/Panzer_ProjectToEdges_impl.hpp:        Intrepid2::CellTools<PHX::exec_space>::mapToReferenceSubcell(refQuadPts, quadPts, subcell_dim, p, parentCell);
./disc-fe/src/evaluators/Panzer_ProjectToEdges_impl.hpp:        Intrepid2::CellTools<PHX::exec_space>::setJacobian(jacobianSide, refQuadPts, physicalNodes, parentCell);
./disc-fe/src/evaluators/Panzer_GatherNormals_impl.hpp:    Intrepid2::CellTools<PHX::exec_space>::getReferenceFaceTangents(refTanU, refTanV, i, parentCell);
./disc-fe/src/evaluators/Panzer_GatherNormals_impl.hpp:  // then computing the normal: see Intrepid2_CellToolsDef.hpp.
./disc-fe/src/evaluators/Panzer_ProjectToFaces_impl.hpp:      Intrepid2::CellTools<PHX::exec_space>::getReferenceFaceTangents(refTanU, refTanV, i, parentCell);
./disc-fe/src/evaluators/Panzer_ProjectToFaces_impl.hpp:        Intrepid2::CellTools<PHX::exec_space>::mapToReferenceSubcell(refQuadPts, quadPts, subcell_dim, p, parentCell);
./disc-fe/src/evaluators/Panzer_ProjectToFaces_impl.hpp:        Intrepid2::CellTools<PHX::exec_space>::setJacobian(jacobianSide, refQuadPts, physicalNodes, parentCell);
./disc-fe/src/evaluators/Panzer_Dirichlet_Residual_EdgeBasis_impl.hpp:#include "Intrepid2_CellTools.hpp"
./disc-fe/src/evaluators/Panzer_Dirichlet_Residual_EdgeBasis_impl.hpp:    Intrepid2::CellTools<PHX::exec_space>::getPhysicalEdgeTangents(edgeTan,
./disc-fe/src/evaluators/Panzer_Dirichlet_Residual_EdgeBasis_impl.hpp:      Intrepid2::CellTools<PHX::exec_space>::getReferenceEdgeTangent(refEdgeTan_local, i, parentCell);
./disc-fe/src/evaluators/Panzer_Dirichlet_Residual_FaceBasis_impl.hpp:#include "Intrepid2_CellTools.hpp"
./disc-fe/src/evaluators/Panzer_Dirichlet_Residual_FaceBasis_impl.hpp:    Intrepid2::CellTools<PHX::exec_space>::getPhysicalFaceNormals(faceNormal,
./disc-fe/src/evaluators/Panzer_Dirichlet_Residual_FaceBasis_impl.hpp:    Intrepid2::CellTools<ScalarT>::getPhysicalFaceNormals(faceNormal,
./disc-fe/src/evaluators/Panzer_Dirichlet_Residual_FaceBasis_impl.hpp:      Intrepid2::CellTools<double>::getReferenceFaceNormal(refFaceNormal_local, i, parentCell);
./disc-fe/src/Panzer_BasisValues2.cpp:    Intrepid2::CellTools<PHX::Device::execution_space> cell_tools;
./disc-fe/src/Panzer_PointValues2.hpp:#include "Intrepid2_CellTools.hpp"
./disc-fe/src/Panzer_PointValues2.hpp:    Intrepid2::CellTools<PHX::exec_space> cell_tools;
[rppawlo@gge panzer]$ cd ..
kyungjoo-kim commented 7 years ago

@rppawlo

Except for ./adapters-stk/src/Panzer_STK_SurfaceNodeNormals.cpp , all panzer functions use the Intrepid2 appropriately. Then, I will give you a serial version that does not invoke any parallel for inside. This will remove the overhead.

Best Kyungjoo

rppawlo commented 7 years ago

Thanks @kyungjoo-kim !

kyungjoo-kim commented 7 years ago

@bathmatt

I tested it on cuda and see it working (I dont know how well it used to be work so I won't be able to confirm that this modification resolve your issue). Could you review the code and if this resolves this overhead issue ?

bathmatt commented 7 years ago

I'm compiling now, will try to let you know tomorrow.

bathmatt commented 7 years ago

Just ran all the panzer tests in 3.5 minutes, much better.. Thanks!