Face gradient caching and FV memory cost

GiudGiud commented 3 years ago

Reason

As raised in #17874, INSFV uses a lot more memory than other NS solvers. Profiling shows that face gradient caching is the primary culprit. Following a discussion with @lindsayad @rwcarlsen and @snschune, we figured that we could reduce heavily the face gradient caching by disabling it for velocities, since it's only used by the diffusion kernel iirc.

Design

A boolean to enable or disable gradient caching in the input No caching and a reference to a threaded vector in the MooseVariableFV and INSFVVelocityVariable

Impact

Lower memory cost Potential speedup by removing unstructured access into maps to retrieve gradients

lindsayad commented 3 years ago

Unstructured access into maps? Do you mean on average O(1) complexity calls to unordered map emplace?

I don’t think you need to say this is specific to INSFV. As @makeclean’s profile showed most of the memory allocation actually happened from FVDiffusion

GiudGiud commented 3 years ago

It is O(1) but it's likely a cache miss as well. Started testing and not seeing a speedup on INSFV. I ll post cpu and memory profiles in the PR

That's true. I ll update the title.

lindsayad commented 3 years ago

Yea I don’t expect to see a speedup. I would look pretty dumb if these caches made the code slower.

The memory usage is what we’re after improving here.

On Jun 4, 2021, at 11:03 AM, Guillaume Giudicelli @.***> wrote:

It is O(1) but it's likely a cache miss as well. Started testing and not seeing a speedup on INSFV. I ll post cpu and memory profiles in the PR

That's true. I ll update the title.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

GiudGiud commented 3 years ago

Attached the memory profiles for a 8k elements 32k DOFs 3D square channel, for navier-stokes oprof build

no_caching_gradients.pdf no_caching.pdf full_caching.pdf

Not caching anything does not seem to hit performance too much, and reduces memory costs quite a bit.

Looking at the code, I realized that for the two_term_boundary_expansion, caching makes things a lot easier. Would be more work to come up with an uncached way of doing that.

lindsayad commented 3 years ago

This indeed looks like a great improvement!! Can you post CPU profiles or at least timings?

On Jun 4, 2021, at 7:24 PM, Guillaume Giudicelli @.***> wrote:

Attached the memory profiles for a 8k elements 32k DOFs 3D square channel, for navier-stokes oprof build

no_caching_gradients.pdf no_caching.pdf full_caching.pdf

Not caching anything does not seem to hit performance too much, and reduces memory costs quite a bit.

Looking at the code, I realized that for the two_term_boundary_expansion, caching makes things a lot easier. Not sure we would want to come up with an uncached way of doing that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

GiudGiud commented 3 years ago

Trying to do a better job at them rn Timings with another level of refinement with full caching: 1 min 8 with no caching: 1 min 9

I ll post CPU profiles soon

GiudGiud commented 3 years ago

The graphs are attached if you like those better. This is an opt build with 2 levels of refinement. Number of samples is default 100/s so these are pretty small runs. Going to try oprof see if I get a more info. This is gperf tools btw, so columns are samples, self %, a cumulative running sum, self+children samples and %.

Curious about the top ones. The Mat stuff is Petsc right? Is it the preconditioner or the solver? EDIT: MatSolve_SeqAIJ_Inode is the PC MatMult_SeqAIJ_Inode is matrix multiplications in GMRES

Text version below: No cache Total: 7725 samples 474 6.1% 6.1% 719 9.3% libMesh::DofMap::_dof_indices 445 5.8% 11.9% 448 5.8% pthread_attr_setschedparam 337 4.4% 16.3% 337 4.4% MatSolve_SeqAIJ_Inode 336 4.3% 20.6% 336 4.3% MatMult_SeqAIJ_Inode 334 4.3% 24.9% 411 5.3% Attribute::Attribute 305 3.9% 28.9% 305 3.9% nss_database_lookup 300 3.9% 32.8% 311 4.0% MatSetValues_SeqAIJ 291 3.8% 36.5% 291 3.8% MetaPhysicL::DynamicSparseNumberBase::operator+= 259 3.4% 39.9% 347 4.5% libMesh::SparsityPattern::Build::handle_vi_vj 238 3.1% 43.0% 377 4.9% libc_malloc 226 2.9% 45.9% 447 5.8% std::_Hashtable::_M_find_before_node 123 1.6% 47.5% 159 2.1% MooseVariableFV::isInternalFace 111 1.4% 48.9% 200 2.6% libMesh::TypeVector::operator* 105 1.4% 50.3% 1311 17.0% MooseVariableFV::getDirichletBC 105 1.4% 51.6% 105 1.4% pthread_mutex_unlock_usercnt (inline) 105 1.4% 53.0% 105 1.4% memcpy 102 1.3% 54.3% 195 2.5% cxxabiv1::si_class_type_info::__do_dyncast 100 1.3% 55.6% 2237 29.0% INSFVVelocityVariable::adGradSln 100 1.3% 56.9% 100 1.3% GI_pthread_mutex_lock 96 1.2% 58.1% 820 10.6% libMesh::DofMap::dof_indices 92 1.2% 59.3% 206 2.7% cxxabiv1::vmi_class_type_info::__do_dyncast 81 1.0% 60.4% 781 10.1% MooseVariableDataFV::computeAD 80 1.0% 61.4% 498 6.4% dynamic_cast 72 0.9% 62.4% 522 6.8% TheWarehouse::queryID 71 0.9% 63.3% 122 1.6% std::__cxx11::basic_string::_M_construct 70 0.9% 64.2% 70 0.9% cfree 67 0.9% 65.0% 799 10.3% INSFVMomentumAdvection::interpolate 67 0.9% 65.9% 67 0.9% VecMAXPY_Seq 58 0.8% 66.7% 123 1.6% std::vector::_M_realloc_insert 54 0.7% 67.4% 54 0.7% VecMDot_Seq 54 0.7% 68.1% 151 2.0% std::_Hashtable::_M_emplace 49 0.6% 68.7% 335 4.3% MooseMesh::cacheVarIndicesByFace 48 0.6% 69.3% 48 0.6% libMesh::::monomial_n_dofs 47 0.6% 69.9% 132 1.7% Assembly::prepareNeighbor 47 0.6% 70.5% 47 0.6% std::_Rb_tree_increment@ba9b9 46 0.6% 71.1% 59 0.8% libMesh::PetscVector::operator 46 0.6% 71.7% 428 5.5% operator new 45 0.6% 72.3% 170 2.2% INSFVMomentumAdvection::rcCoeff

Cache face values, but not gradients Total: 7516 samples 517 6.9% 6.9% 519 6.9% pthread_attr_setschedparam 346 4.6% 11.5% 507 6.7% libMesh::DofMap::_dof_indices 330 4.4% 15.9% 330 4.4% MatMult_SeqAIJ_Inode 329 4.4% 20.3% 329 4.4% MatSolve_SeqAIJ_Inode 301 4.0% 24.3% 301 4.0% nss_database_lookup 288 3.8% 28.1% 311 4.1% MatSetValues_SeqAIJ 287 3.8% 31.9% 389 5.2% Attribute::Attribute 265 3.5% 35.4% 346 4.6% libMesh::SparsityPattern::Build::handle_vi_vj 262 3.5% 38.9% 452 6.0% __libc_malloc 251 3.3% 42.3% 251 3.3% MetaPhysicL::DynamicSparseNumberBase::operator+= 234 3.1% 45.4% 413 5.5% std::_Hashtable::_M_find_before_node 173 2.3% 47.7% 399 5.3% std::_Hashtable::_M_emplace 159 2.1% 49.8% 185 2.5% MooseVariableFV::isInternalFace 133 1.8% 51.6% 213 2.8% libMesh::TypeVector::operator* 121 1.6% 53.2% 121 1.6% memcpy 109 1.5% 54.6% 204 2.7% cxxabiv1::vmi_class_type_info::__do_dyncast 102 1.4% 56.0% 202 2.7% cxxabiv1::si_class_type_info::__do_dyncast 99 1.3% 57.3% 1974 26.3% INSFVVelocityVariable::adGradSln 99 1.3% 58.6% 99 1.3% GI_pthread_mutex_lock 98 1.3% 59.9% 817 10.9% MooseVariableDataFV::computeAD 82 1.1% 61.0% 512 6.8% __dynamic_cast 77 1.0% 62.0% 1173 15.6% MooseVariableFV::getDirichletBC 75 1.0% 63.0% 472 6.3% TheWarehouse::queryID 70 0.9% 64.0% 881 11.7% INSFVMomentumAdvection::interpolate 70 0.9% 64.9% 70 0.9% pthread_mutex_unlock_usercnt (inline) 70 0.9% 65.8% 70 0.9% cfree 68 0.9% 66.7% 113 1.5% std::cxx11::basic_string::_M_construct 60 0.8% 67.5% 60 0.8% VecMAXPY_Seq 59 0.8% 68.3% 59 0.8% VecMDot_Seq 57 0.8% 69.1% 575 7.7% libMesh::DofMap::dof_indices 50 0.7% 69.7% 194 2.6% INSFVMomentumAdvection::rcCoeff 49 0.7% 70.4% 124 1.6% Assembly::prepareNeighbor 49 0.7% 71.0% 326 4.3% MooseMesh::cacheVarIndicesByFace 48 0.6% 71.7% 104 1.4% std::vector::_M_realloc_insert 46 0.6% 72.3% 507 6.7% operator new 45 0.6% 72.9% 162 2.2% MooseMesh::faceInfo 44 0.6% 73.5% 44 0.6% std::_Rb_tree_increment 41 0.5% 74.0% 41 0.5% cxxabiv1::class_type_info::dyncast_result::__dyncast_result (inline) 41 0.5% 74.6% 41 0.5% libMesh::Cell::dim 40 0.5% 75.1% 40 0.5% load_bytes (inline) 38 0.5% 75.6% 39 0.5% MatLUFactorNumeric_SeqAIJ_Inode 34 0.5% 76.1% 47 0.6% Assembly::prepareJacobianBlock 32 0.4% 76.5% 2659 35.4% MooseVariableFV::adGradSln

Cache face values and gradients Using local file cpu.prof. Total: 7554 samples 528 7.0% 7.0% 530 7.0% pthread_attr_setschedparam 333 4.4% 11.4% 333 4.4% MatSolve_SeqAIJ_Inode 330 4.4% 15.8% 330 4.4% MatMult_SeqAIJ_Inode 319 4.2% 20.0% 478 6.3% libMesh::DofMap::_dof_indices 309 4.1% 24.1% 326 4.3% MatSetValues_SeqAIJ 305 4.0% 28.1% 415 5.5% Attribute::Attribute 295 3.9% 32.0% 295 3.9% nss_database_lookup 289 3.8% 35.8% 494 6.5% libc_malloc 257 3.4% 39.3% 257 3.4% MetaPhysicL::DynamicSparseNumberBase::operator+= 256 3.4% 42.6% 338 4.5% libMesh::SparsityPattern::Build::handle_vi_vj 217 2.9% 45.5% 394 5.2% std::_Hashtable::_M_find_before_node 180 2.4% 47.9% 423 5.6% std::_Hashtable::_M_emplace 143 1.9% 49.8% 173 2.3% MooseVariableFV::isInternalFace 137 1.8% 51.6% 231 3.1% libMesh::TypeVector::operator* 112 1.5% 53.1% 112 1.5% memcpy 111 1.5% 54.6% 218 2.9% cxxabiv1::vmi_class_type_info::do_dyncast 101 1.3% 55.9% 1992 26.4% INSFVVelocityVariable::adGradSln 101 1.3% 57.2% 1227 16.2% MooseVariableFV::getDirichletBC 100 1.3% 58.6% 187 2.5% cxxabiv1::si_class_type_info::do_dyncast 100 1.3% 59.9% 100 1.3% cfree 99 1.3% 61.2% 99 1.3% GI_pthread_mutex_lock 90 1.2% 62.4% 788 10.4% MooseVariableDataFV::computeAD 74 1.0% 63.4% 492 6.5% dynamic_cast 71 0.9% 64.3% 71 0.9% pthread_mutex_unlock_usercnt (inline) 68 0.9% 65.2% 549 7.3% libMesh::DofMap::dof_indices@3a7150 67 0.9% 66.1% 101 1.3% std::__cxx11::basic_string::_M_construct 59 0.8% 66.9% 141 1.9% Assembly::prepareNeighbor 58 0.8% 67.6% 889 11.8% INSFVMomentumAdvection::interpolate 57 0.8% 68.4% 437 5.8% TheWarehouse::queryID cache_values_not_grad.pdf nocache.pdf full_caching.pdf

GiudGiud commented 3 years ago

Probably needs more samples. Some trends: adGradSln (29 - 26 - 26%) so caching helps Interpolate is marginally faster when values are not cached (10 - 11.5 12%)

What is computeAD? I wonder what is under dynamic cast since self and self+callees is quite different

EDIT oprof build profile is a little hard to read too. The cumulative numbers are going down without anything claiming the samples as 'self'. I wonder if it s because I didnt install libunwind. no_caching.pdf

lindsayad commented 3 years ago

Where are you getting your pprof from for postprocessing the profiles? The graphs aren't as pretty as when I make them 😄 Do you know if you're using https://github.com/google/pprof or the pprof from https://github.com/gperftools/gperftools?

lindsayad commented 3 years ago

At least the callees at the bottom of the stack claim some samples for themselves but I agree that in places it looks like the sample count is going down seemingly by magic.

The overall trend does seem to be though:

Not caching saves a lot in memory while fairly negligibly affecting the CPU performance.

lindsayad commented 2 years ago

Closed by #18012

idaholab / moose