Closed bathmatt closed 7 years ago
One suggestion, specialize getReferenceVertex to take a kokkos view or a double* and make the view version to the error checking and call down to the double one and then make v0 a double v0[Parameters::MaxDimension]
Alternatively you could have an "expert" usage getReferenceVertex that accesses the static data in a subview way. Basically trading off safety for performance. (I hope I'm understanding the code enough).
E.g. you would add or change the signature of:
getReferenceNode( Kokkos::DynRankView<cellNodeValueType,cellNodeProperties...> cellNode,
const shards::CellTopology cell,
const ordinal_type nodeOrd ) ;
to
Kokkos::DynRankView</*can you add a const here? */ cellNodeValueType,cellNodeProperties...>
getReferenceNode_expert(
const shards::CellTopology cell,
const ordinal_type nodeOrd ) ;
and then return the subview. This then works its way up through getReferenceVertex
A more mild improvement (dimension dependent) would move the allocations under their appropriate condition statement.
Also, I'm a little surprised that this date can't be hashed by cell type.
Maybe it should return a static const reference to a cell type.
On Sat, Mar 18, 2017 at 11:06 AM, Eric C. Cyr notifications@github.com wrote:
Also, I'm a little surprised that this date can't be hashed by cell type.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/trilinos/Trilinos/issues/1151#issuecomment-287559785, or mute the thread https://github.com/notifications/unsubscribe-auth/AOPDIKEnLJCPRBU5hiukG0QdGdQcekhFks5rnA76gaJpZM4MhCKE .
I expect this problem when I work on cell tools. At that time, I do not want to change API and just convert all functions using dynrankview. Maybe it is time to clean up the interface more useful. We first need to separate whether we want to use a function in a functor or front level (parallel for).
@kyungjoo-kim Do we have a timeline for getting a fix?
We need to set up a meeting in person to discuss the problems that @bathmatt found so far. This is not the only problem and I would like to find out design flaws in APIs.
Please set this up with myself and @rppawlo
This is a large effort. @kyungjoo-kim will set up a meeting to discuss.
@bathmatt
Could you give me your psedo code that causes this performance degrade ?
What I see now from the code are :
Maybe the solution is simple. Do you think that this solve the problem ? Anyway I want to see a big picture how these functions are used in your code and will better be used in the future code.
It is from this function that they are called. We get a normal for face on a sideset, looping over the faces.
Intrepid2::CellTools<PHX::exec_space>::getPhysicalSideNormals(normal, iv.jac.get_view(), *sideID, *(ir->topology));
File ./adapters-stk/src/Panzer_STK_SurfaceNodeNormals.cpp
@rppawlo @bathmatt
There are some problems in Panzer_STK_SurfaceNodeNormals.cpp.
for ( ; sideID != localSideTopoIDs.end(); ++side,++sideID,++parentElement) {
// blah .... blah ...
Intrepid2::CellTools<PHX::exec_space>::getPhysicalSideNormals(normal, iv.jac.get_view(), *sideID, *(ir->topology));
}
This is not an ideal case of use the front interface where it evaluate a "single" normal. All front interface calls internally parallel_for, which means that for a single normal evaluation many times of kernel launch are made.
This probably requires another serial interface that evaluates a normal for a single point or multiple points associated with a single cell. I can give you an implementation to see if this can remove the overhead.
Is this the only function that brings such overhead ? Other code that uses tangent or normal ?
There are some other calls to normals/tangents. Below is a quick search of panzer.
[rppawlo@gge panzer]$ find . -name '*pp' | xargs grep CellTool
./dof-mgr/src/Panzer_NodalFieldPattern.cpp:#include "Intrepid2_CellTools.hpp"
./dof-mgr/src/Panzer_IntrepidFieldPattern.cpp:#include "Intrepid2_CellTools.hpp"
./dof-mgr/src/Panzer_IntrepidFieldPattern.cpp: Intrepid2::CellTools<PHX::Device> cellTools;
./adapters-stk/src/Panzer_STK_SurfaceNodeNormals.cpp:#include "Intrepid2_CellTools.hpp"
./adapters-stk/src/Panzer_STK_SurfaceNodeNormals.cpp: Intrepid2::CellTools<PHX::exec_space>::getPhysicalSideNormals(normal, iv.jac.get_view(), *sideID, *(ir->topology));
./disc-fe/src/Panzer_IntegrationValues2.cpp:#include "Intrepid2_CellTools.hpp"
./disc-fe/src/Panzer_IntegrationValues2.cpp: Intrepid2::CellTools<PHX::Device::execution_space> cell_tools;
./disc-fe/src/Panzer_IntegrationValues2.cpp: Intrepid2::CellTools<PHX::Device::execution_space> cell_tools;
./disc-fe/src/Panzer_IntegrationValues2.cpp: Intrepid2::CellTools<PHX::Device::execution_space> cell_tools;
./disc-fe/src/responses/Panzer_ResponseScatterEvaluator_Probe_impl.hpp: typedef Intrepid2::CellTools<PHX::exec_space> CTD;
./disc-fe/src/Panzer_Workset_Builder_impl.hpp:#include "Intrepid2_CellTools.hpp"
./disc-fe/src/Panzer_PointValues2_impl.hpp:#include "Intrepid2_CellTools.hpp"
./disc-fe/src/evaluators/Panzer_GatherTangents_impl.hpp: Intrepid2::CellTools<PHX::exec_space>::getReferenceEdgeTangent(refEdgeTan_local, i, parentCell);
./disc-fe/src/evaluators/Panzer_Normals_impl.hpp:#include "Intrepid2_CellTools.hpp"
./disc-fe/src/evaluators/Panzer_Normals_impl.hpp: Intrepid2::CellTools<PHX::exec_space>::getPhysicalSideNormals(normals.get_view(),
./disc-fe/src/evaluators/Panzer_ProjectToEdges_impl.hpp: Intrepid2::CellTools<PHX::exec_space>::getReferenceEdgeTangent(refEdgeTan_local, i, parentCell);
./disc-fe/src/evaluators/Panzer_ProjectToEdges_impl.hpp: Intrepid2::CellTools<PHX::exec_space>::mapToReferenceSubcell(refQuadPts, quadPts, subcell_dim, p, parentCell);
./disc-fe/src/evaluators/Panzer_ProjectToEdges_impl.hpp: Intrepid2::CellTools<PHX::exec_space>::setJacobian(jacobianSide, refQuadPts, physicalNodes, parentCell);
./disc-fe/src/evaluators/Panzer_GatherNormals_impl.hpp: Intrepid2::CellTools<PHX::exec_space>::getReferenceFaceTangents(refTanU, refTanV, i, parentCell);
./disc-fe/src/evaluators/Panzer_GatherNormals_impl.hpp: // then computing the normal: see Intrepid2_CellToolsDef.hpp.
./disc-fe/src/evaluators/Panzer_ProjectToFaces_impl.hpp: Intrepid2::CellTools<PHX::exec_space>::getReferenceFaceTangents(refTanU, refTanV, i, parentCell);
./disc-fe/src/evaluators/Panzer_ProjectToFaces_impl.hpp: Intrepid2::CellTools<PHX::exec_space>::mapToReferenceSubcell(refQuadPts, quadPts, subcell_dim, p, parentCell);
./disc-fe/src/evaluators/Panzer_ProjectToFaces_impl.hpp: Intrepid2::CellTools<PHX::exec_space>::setJacobian(jacobianSide, refQuadPts, physicalNodes, parentCell);
./disc-fe/src/evaluators/Panzer_Dirichlet_Residual_EdgeBasis_impl.hpp:#include "Intrepid2_CellTools.hpp"
./disc-fe/src/evaluators/Panzer_Dirichlet_Residual_EdgeBasis_impl.hpp: Intrepid2::CellTools<PHX::exec_space>::getPhysicalEdgeTangents(edgeTan,
./disc-fe/src/evaluators/Panzer_Dirichlet_Residual_EdgeBasis_impl.hpp: Intrepid2::CellTools<PHX::exec_space>::getReferenceEdgeTangent(refEdgeTan_local, i, parentCell);
./disc-fe/src/evaluators/Panzer_Dirichlet_Residual_FaceBasis_impl.hpp:#include "Intrepid2_CellTools.hpp"
./disc-fe/src/evaluators/Panzer_Dirichlet_Residual_FaceBasis_impl.hpp: Intrepid2::CellTools<PHX::exec_space>::getPhysicalFaceNormals(faceNormal,
./disc-fe/src/evaluators/Panzer_Dirichlet_Residual_FaceBasis_impl.hpp: Intrepid2::CellTools<ScalarT>::getPhysicalFaceNormals(faceNormal,
./disc-fe/src/evaluators/Panzer_Dirichlet_Residual_FaceBasis_impl.hpp: Intrepid2::CellTools<double>::getReferenceFaceNormal(refFaceNormal_local, i, parentCell);
./disc-fe/src/Panzer_BasisValues2.cpp: Intrepid2::CellTools<PHX::Device::execution_space> cell_tools;
./disc-fe/src/Panzer_PointValues2.hpp:#include "Intrepid2_CellTools.hpp"
./disc-fe/src/Panzer_PointValues2.hpp: Intrepid2::CellTools<PHX::exec_space> cell_tools;
[rppawlo@gge panzer]$ cd ..
@rppawlo
Except for ./adapters-stk/src/Panzer_STK_SurfaceNodeNormals.cpp
, all panzer functions use the Intrepid2 appropriately. Then, I will give you a serial version that does not invoke any parallel for inside. This will remove the overhead.
Best Kyungjoo
Thanks @kyungjoo-kim !
@bathmatt
I tested it on cuda and see it working (I dont know how well it used to be work so I won't be able to confirm that this modification resolve your issue). Could you review the code and if this resolves this overhead issue ?
I'm compiling now, will try to let you know tomorrow.
Just ran all the panzer tests in 3.5 minutes, much better.. Thanks!
So, many routines that get normals and edge tangents call down to
Which isn't such a big deal except that in this we create some kokkos views
Now, on CudaUVM this takes quite a bit of time. Since we do this once for every element we need to get info from this is a huge amount of time. For a small test this is 2 minutes of the time, vs 6 seconds to fill and solve the matrix that comes out of the system.
Not sure on the best way to resolve this?
You call through to getReferenceVertex which takes a view.
How should we move forward? @kyungjoo-kim @eric-c-cyr @rppawlo @crtrott