sandialabs / Albany

Sandia National Laboratories' Albany multiphysics code
Other
274 stars 90 forks source link

Possible memory leak in Albany #150

Closed jwfoulk closed 7 years ago

jwfoulk commented 7 years ago

Greetings,

I believe there is a memory leak in Albany. I see this with medium size jobs (< 200,000 nodes) on machines with 256 Gb of memory. The job just dies as memory requirements go through the roof on initialization. This issues does not seem to affect the test suite because the meshes might be small enough. Not sure. Obviously the tests run. Rather than focus on the large jobs, I tried to replicate some of the issues with valgrind on smaller problems. I was running two smaller problems and I received similar reporting from valgrind. To make it easier, I just chose a boundary-value-problem that was in the tests repository.

tests/large/LCM/CrystalPlasticity/MultiSlipHard

valgrind --leak-check=full --leak-resolution=high --num-callers=40 /home/jwfoulk/LCM/albany-build-serial-gcc-debug/src/AlbanyT MultiSlipPlaneHard_Implicit.yaml &> memoryLeak.log

I'm attaching the log file.

I wish the information was more fine grained. I'll keep looking at this - I just wanted to update Dan and Alejandro.

Cheers,

Jay

memoryLeak.txt

ibaned commented 7 years ago

Hmm all of the leaks reported in this log are due to OpenMPI doing very illegal things with memory. As horrific as their code is, I don't think it has serious leaks. I recommend building with MPICH, even though that is sadly rarely offered at Sandia. Could we switch gears to the mesh simulation that was actually running out of large amounts of RAM?

djlittl commented 7 years ago

Hi Jay. I would be good to do a back-of-the-envelope calculation for the memory required to store one element's worth of state data. How many integration points per element, and how many state data (doubles) per integration point?

ibaned commented 7 years ago

Well, assuming "200,000 nodes" means finite element nodes in the mesh, then it should not be approaching 256GB. That would be over 1MB per finite element node, which is several orders of magnitude higher than any simulation I've seen.

lxmota commented 7 years ago

Jay, have you tried running this problem on Algol? I just want to make sure that this is not machine specific. Remember that Proxima was issuing some weird memory errors.

ibaned commented 7 years ago

Well, I guess you have everyone's attention :) Can you make the files available to us for the case that reallly did run out of memory?

lxmota commented 7 years ago

@ibaned Would our build system and modules work with MPICH?

ibaned commented 7 years ago

They are based on the existing modules, and I don't recall seeing an existing MPICH module on any machine. We also don't offer a choice right now, so if we do switch it would be a permanent change of default.

jwfoulk commented 7 years ago

I am running to another meeting but I will get back to everyone tonight regarding the data, machine, and the problem of interest. Many thanks for all the comments.

djlittl commented 7 years ago

I still think the napkin calculation is worthwhile. If for no other reason that to get an idea of what we should expect.

Are there 10 integration points in a composite tet? I'll go with that. I estimate 300 scalars per material point (the sierra version has 117 doubles, times two for state old and state new, then beef it up a bit to 300 because Albany). And I'll go with 200,000 elements to get in the ballpark for a mesh with 200,000 nodes.

10 mat_pt/element 200000 elements 300 doubles/mat_pt * 8 bytes/double = 4.8e9 bytes = 4.8 GB

And then there's some big multiplier for AD types, right?

(Corrections welcome on my math above.)

ibaned commented 7 years ago

Albany being implicit versus Sierra being explicit makes a big difference. In our recent experiences at RPI, the dominant memory usage has always been the Krylov vectors for the GMRES solver, because we need a couple thousand of them in order to converge.

The AD types are allocated one workset at a time, so decreasing the workset size should ameliorate that cost.

lxmota commented 7 years ago

@jwfoulk With regards to machines, I would suggest:

The most severe leaks occur with hwloc, so the next step would be to disable it either on Algol or Proxima and run with valgrind again. But there would be still serious leaks from OpenMPI itself. Maybe we have a buggy version.

@djlittl There are 4 integration points for the composite tet. And I believe the CP model has a maximum of 48 internal variables, so the AD types will have that extra storage.

ibaned commented 7 years ago

I would expect to see these OpenMPI leaks everywhere. Its a long-standing well-known issue with OpenMPI. However, they should be constant-size one-time leaks, not something that can actually take down 200GB.

djlittl commented 7 years ago

@ibaned Right, I was just looking at the state data for the material model. I agree with your point about the solvers. (FWIW, Sierra runs both explicit and implicit.)

The more I think about it, it's pretty difficult to estimate what the memory requirements for the AD types are. I think @lxmota is correct that the maximum number of unknowns in the state update solve is 48, but there is also AD data coming in from Albany for the tangent matrix calculation (I think we end up with nested AD types). I don't think a multiplier of one zillion can be ruled out.

ibaned commented 7 years ago

Although I don't know a ton about the tangent computation, I do know it should also be on a per-workset basis. So that can still be controlled using workset size. Still, the historical statistics on symptoms like these are overwhelmingly in favor of a silly mistake or bug, rather than an unexpected consequence of code working exactly as expected.

djlittl commented 7 years ago

I can vouch for the fact that there's no way the CP model runs as expected :)

jwfoulk commented 7 years ago

First - i agree with Dave. We still have some work to do regarding the CP model. We'll get there.

The problem in question with 128,000 elements having 8 integration points is found on the CEE Lan at:

/gpfs1/jwfoulk/polycrystalIssue

3DPolycrystalMulti_1_Rate_Ten_Newton_LMR_Material.yaml 3DPolycrystalMulti_1_Rate_Ten_Newton_LMR.yaml meshpt2xpt1xpt1_151blocks_1.g

I'm asking for the Cauchy stress (9), deformation gradient (9), CP residual, CP iterations, eqps and output integration weights. By Dave's math, I believe that comes to 0.18 Gb for the interval variables and a bit more for nodal quantities (displacements, residuals).

Sorry for my late reply. Thanks again.

jwfoulk commented 7 years ago

Hi everyone,

I have some news. I found an interesting symptom that I thought I might relate.

First and foremost, I ran the above problem in serial. It took 37.2 Gb of memory. The solver converged, etc. This is well below the 256 Gb of memory on Proxima & Algol. I should note that the same issues occur on both machines.

I decided to then just run the problem on 2, 3, and 4 processors and monitor the memory usage (through my sophisticated tool - top).

1 processor: 37.2 Gb of memory 2 processors: 32.9 Gb and 35.8 Gb of memory 3 processors: 38.0 Gb, 34.1 Gb, and 33.2 Gb of memory 4 processors: 36.8 Gb, 36.5 Gb, 33.7 Gb, and 31.2 Gb of memory

As you can see, the memory is not reducing with additional processors (as one might expect). Albany is allocating > 30 Gb of memory for each processor no matter the number of processors. No wonder 32 processors crashed - the memory requirements would have been ~ 1 Tb.

I know we don't run performance tests, but has anyone seen this kind of behavior? I guess my next step is to do bi-section provided I can get the code to build.

-Jay

ikalash commented 7 years ago

Could this be due to AD? I have seen huge memory usage due to AD, especially when using DFAD (dynamical allocation of AD arrays). Switching to SLFAD can help in this case.

jwfoulk commented 7 years ago

Hi Irina,

I'm not sure. I just remember running these jobs a while back and now I run out of memory. I have been bisecting the issue this morning. I looks like on 1/20/2017 the memory usage looks good. I'm using a little over 3 Gb per processor (for 4 processors). This is in contrast to > 30 Gb/processor (see above). It was "broken" on 2/8/2017 so I'm already within 3 weeks of finding it. I'll keep you updated. The process is not exactly rigorous - I'm just pulling trilnos and albany on the same day and hoping they build. We'll see...

I do run the simulations for a few hours to make sure everything gets invoked and I both cut the step and converge to equilibrium with the boundary value problem having 151 element blocks employing crystal plasticity.

Thanks for the note and your interest.

ibaned commented 7 years ago

Hi Jay, back from travel. Its good news that you identified a past commit that doesn't use as much memory! It sounds like you're converging on the time range too, please keep us posted. Thanks

jwfoulk commented 7 years ago

Hi everyone,

I believe I tracked down the commit that causes issues with memory.

Commit that works:

commit 0efac37ba020d2f8e6827b5cbca4bb50dd0de561 Author: Dan Ibanez dan.a.ibanez@gmail.com Date: Tue Jan 31 12:53:06 2017 -0700

progress on issue #34 in LCM surface-element

Commit that causes the memory issue in parallel

commit 2eb2559f23cca3b7979d07f952687944ac17be7f Author: Dan Ibanez daibane@sandia.gov Date: Tue Jan 31 21:16:57 2017 -0700

changing all default workset sizes to 1000

per issue #43

there is also now a single unique definition,
and the parameter type is explicitly chosen
instead of relying on the default literal type

In both cases, Albany was compiled with this flavor of Trilinos

commit 7abbaad5ecb6841cfbe4a732a0969c1f34e230a0 Author: Mark Hoemmen mhoemme@sandia.gov Date: Thu Feb 2 09:06:39 2017 -0700

I do not set the workset size in my input deck.

Hope that helps.

ibaned commented 7 years ago

Ack! My name is on these!

In all seriousness, it seems like you simply have a setup in which very large amounts of memory are allocated per workset element. The old default used to be 50, and I changed it to 1000. So we have:

(30GB - 3GB) / (1000 elem - 50 elem) = 28MB per element

There are basically two things I can say here:

  1. 28MB per element is really very large. If I were you or someone else developing this model, I'd look very carefully to see if all that memory is indeed needed for the physics. There may simply be a lot of wasted memory.
  2. I still think 1000 is a better default. Even assuming your model really needs that much memory, it seems to be the exception and not the rule across uses of Albany. The default is just a constant right now, so we should choose something that works better for the common case, not the worst case.

So I basically recommend the following:

  1. Set the workset size explicitly, to 50 or even less.
  2. Someone ought to do some back of the envelope calculations on expected memory usage, and/or use tools like Valgrind Massif to study the sources of allocations.

As @ikalash points out, AD can be part of this story, and switching to SLFad may be a quick way to get some data.

lxmota commented 7 years ago

I made some changes to the profile module to use valgrind to help debug this. The changes are in the commit referenced above. The actual lines for usage are missing from the commit message. Here is how to use this:

$ module load lcm/fedora # or lcm/sems or lcm/cluster
$ module load profile
$ cd ~LCM
$ ./clean-config-build.sh trilinos 72 -V && ./clean-config-build.sh albany 72 -V
$ valgrind --tool=callgrind AlbanyT input.yaml
$ kcachegrind
jwfoulk commented 7 years ago

I just wanted to give everyone some practical data on the largish problem noted above. This data does point to the fact that workset sides exceeding 50 can problematic on clusters with models having very few internal state variables (J2 plasticity). It also points to the fact that minisolver is probably not the issue because one can run J2 constitutive models with and without minisolver and expend the same amount of memory.

For instance, if I run J2 without minisolver on 128,000 hex8 elements and monitor the memory usage, I find that workset sizes of 25, 50, 100, 200, and 400 result in memory footprints on each core of roughly 1.0 Gb, 1.3, Gb, 2.2 Gb, 4.0 Gb., and 7.7 Gb, respectively. Yes - the crystal plasticity nearly doubles the memory usage (1.7 Gb for a workset of 25), but those footprints are quite large for a standard constitutive model. Given that skybridge can support roughly 4 Gb/core (each node has 64 Gb of memory and 16 cores), one cannot practically exceed workset sizes of 100 with J2 plasticity.

Even in serial, the workset size dominates the memory footprint. Running that same model with J2 in serial, workset sizes of 25, 50, 100, 200, and 400 result in memory usage of 3.0 Gb, 3.4 Gb, 4.3 Gb, 6.1 Gb, and 9.8 Gb, respectively. This may point to the fact that AD might be the driver (and not the number of intergration points, internal state variables, or even the assembled system). Clearly more analysis is needed.

Perhaps this is obvious. That said, I wanted to post these finding because people have been having mpi errors on clusters and I believe the root of those errors is probably memory usage.

I ran J2 on skybridge with a workset size of 50 and I also ran crystal plasticity on skybridge with a workset size of 25. In both cases, I kept the memory usage around 1.5 Gb/core and experienced no issues. Good news for the intel builds. Thanks again to Dan for getting that rolling.

So - we still need to understand the root of the issue - I just wanted to give everyone some data. Based on these limited findings, a workset size of 1000 is not quite right. Maybe 50 is a place to begin and people running more memory intensive models (like crystal plasticity) will be required to manually set the workset size to 25.

ibaned commented 7 years ago

Thanks @jwfoulk for getting us another step closer to the answer! I was at USNCCM this week but next week when I can get at the CEE LAN I'll run your input with Valgrind and try to track down exactly what component is eating all these gigabytes.

ibaned commented 7 years ago

Apparently @jwfoulk found similar issues with J2 and even elasticity. Awaiting input files for elasticity case, as this is likely the easiest case to debug. If I get time in the short term before getting these files, will start with @jwfoulk 's crystal case.

ibaned commented 7 years ago

Just ran Valgrind Massif on the crystal plasticity files provided by @jwfoulk.

It did allocate over 30GB, and the vast majority of memory is in Kokkos+DFad containers for the Jacobian computation (the 45% below is misleading, there is another trace that consumes another 30% which is almost identical):

->97.75% (33,166,331,688B) 0x123352F7: Kokkos::HostSpace::allocate(unsigned long) const (Kokkos_HostSpace.cpp:179)
| ->97.75% (33,166,317,856B) 0x123361C6: Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, void>::SharedAllocationRecord(Kokkos::HostSpace const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, void (*)(Kokkos::Impl::SharedAllocationRecord<void, void>*)) (Kokkos_HostSpace.cpp:331)
| | ->94.38% (32,020,040,064B) 0x59BB6D6: Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewValueFunctor<Kokkos::Serial, double, true> >::allocate(Kokkos::HostSpace const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long) (Kokkos_SharedAlloc.hpp:214)
| | | ->45.07% (15,292,081,056B) 0x61900DB: PHX::KokkosViewFactory<Sacado::Fad::DFad<double>, Kokkos::Serial>::buildView(PHX::FieldTag const&, std::vector<int, std::allocator<int> > const&) (KokkosExp_View_Fad.hpp:969)
| | | | ->45.07% (15,292,081,056B) 0x6199C0F: PHX::EvaluationContainer<PHAL::AlbanyTraits::Jacobian, PHAL::AlbanyTraits>::postRegistrationSetup(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, PHX::FieldManager<PHAL::AlbanyTraits>&) (Phalanx_KokkosViewFactoryFunctor.hpp:35)
| | | |   ->45.07% (15,292,081,056B) 0x6174E88: void PHX::FieldManager<PHAL::AlbanyTraits>::postRegistrationSetupForType<PHAL::AlbanyTraits::Jacobian>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (Phalanx_FieldManager_Def.hpp:216)
| | | |     ->45.07% (15,292,081,056B) 0x6146D11: Albany::Application::postRegSetup(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (Albany_Application.cpp:4747)                          
| | | |       ->45.07% (15,292,081,056B) 0x615BC4B: Albany::Application::computeGlobalJacobianImplT(double, double, double, double, Teuchos::RCP<Tpetra::Vector<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false> const> const&, Teuchos::RCP<Tpetra::Vector<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false> const> const&, Teuchos::RCP<Tpetra::Vector<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false> const> const&, Teuchos::Array<Sacado::ScalarParameterVector<SPL_Traits> > const&, Teuchos::RCP<Tpetra::Vector<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false> > const&, Teuchos::RCP<Tpetra::CrsMatrix<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false> > const&) (Albany_Application.cpp:1831)
| | | |         ->45.07% (15,292,081,056B) 0x615F043: Albany::Application::computeGlobalJacobianT(double, double, double, double, Tpetra::Vector<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false> const*, Tpetra::Vector<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false> const*, Tpetra::Vector<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false> const&, Teuchos::Array<Sacado::ScalarParameterVector<SPL_Traits> > const&, Tpetra::Vector<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>*, Tpetra::CrsMatrix<double, int, long long, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>&) (Albany_Application.cpp:2201)
| | | |           ->45.07% (15,292,081,056B) 0x61B24AB: Albany::ModelEvaluatorT::evalModelImpl(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const (Albany_ModelEvaluatorT.cpp:739)

The next step is to dig further into each of these allocations and see if they're in the expected ballpark.

I'd really prefer to have input files for J2 or elasticity though.... This particular case has 151 element blocks, so its really pushing the limits along that dimension.

ibaned commented 7 years ago

Okay, I think the key to all this is that @jwfoulk 's case has 151 element blocks. Before computing the Jacobian, Albany will preallocate Kokkos DFad arrays for one workset... for every element block. Here is a detailed analysis of the Kokkos arrays used by one workset (of the first element block):

allocated (Displacement:Sacado::Fad::DFad<double>:<Cell,Node,Dim>(882,8,3), 882, 8, 3, 25
view bytes: 4233600
allocated (Velocity:Sacado::Fad::DFad<double>:<Cell,Node,Dim>(882,8,3), 882, 8, 3, 25
view bytes: 4233600
allocated (Acceleration:Sacado::Fad::DFad<double>:<Cell,Node,Dim>(882,8,3), 882, 8, 3, 25
view bytes: 4233600
allocated (Acceleration:Sacado::Fad::DFad<double>:<Cell,QuadPoint,Dim>(882,8,3), 882, 8, 3, 25
view bytes: 4233600
allocated (Displacement Gradient:Sacado::Fad::DFad<double>:<Cell,QuadPoint,Dim,Dim>(882,8,3,3), 882, 8, 3, 3, 25
view bytes: 12700800
allocated (F:Sacado::Fad::DFad<double>:<Cell,QuadPoint,Dim,Dim>(882,8,3,3), 882, 8, 3, 3, 25
view bytes: 12700800
allocated (J:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (Time:Sacado::Fad::DFad<double>:<Dummy>(1), 1, 25
view bytes: 200
allocated (Delta Time:Sacado::Fad::DFad<double>:<Dummy>(1), 1, 25
view bytes: 200
allocated (CP_Residual:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (CP_Residual_Iter:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (Cauchy_Stress:Sacado::Fad::DFad<double>:<Cell,QuadPoint,Dim,Dim>(882,8,3,3), 882, 8, 3, 3, 25
view bytes: 12700800
allocated (Fp:Sacado::Fad::DFad<double>:<Cell,QuadPoint,Dim,Dim>(882,8,3,3), 882, 8, 3, 3, 25
view bytes: 12700800
allocated (Re:Sacado::Fad::DFad<double>:<Cell,QuadPoint,Dim,Dim>(882,8,3,3), 882, 8, 3, 3, 25
view bytes: 12700800
allocated (Velocity_Gradient:Sacado::Fad::DFad<double>:<Cell,QuadPoint,Dim,Dim>(882,8,3,3), 882, 8, 3, 3, 25
view bytes: 12700800
allocated (Velocity_Gradient_Plastic:Sacado::Fad::DFad<double>:<Cell,QuadPoint,Dim,Dim>(882,8,3,3), 882, 8, 3, 3, 25
view bytes: 12700800
allocated (eqps:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_1:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_10:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_11:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_12:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_2:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_3:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_4:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_5:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_6:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_7:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_8:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_9:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_dot_1:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_dot_10:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_dot_11:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_dot_12:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_dot_2:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_dot_3:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_dot_4:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_dot_5:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_dot_6:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_dot_7:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_dot_8:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (gamma_dot_9:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_1:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_10:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_11:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_12:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_2:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_3:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_4:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_5:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_6:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_7:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_8:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_9:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_hard_1:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_hard_10:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_hard_11:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_hard_12:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_hard_2:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_hard_3:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_hard_4:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_hard_5:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_hard_6:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_hard_7:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_hard_8:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (tau_hard_9:Sacado::Fad::DFad<double>:<Cell,QuadPoint>(882,8), 882, 8, 25
view bytes: 1411200
allocated (FirstPK:Sacado::Fad::DFad<double>:<Cell,QuadPoint,Dim,Dim>(882,8,3,3), 882, 8, 3, 3, 25
view bytes: 12700800
allocated (Displacement Residual:Sacado::Fad::DFad<double>:<Cell,Node,Dim>(882,8,3), 882, 8, 3, 25
view bytes: 4233600
allocated (Scatter:Sacado::Fad::DFad<double>:<Dummy>(0), 0, 25
view bytes: 0
allocated (F:Sacado::Fad::DFad<double>:<Cell,Dim,Dim>(0,3,3), 0, 3, 3, 25
view bytes: 0
allocated (strain:Sacado::Fad::DFad<double>:<Cell,Dim,Dim>(0,3,3), 0, 3, 3, 25
view bytes: 0
allocated (gradu:Sacado::Fad::DFad<double>:<Cell,Dim,Dim>(0,3,3), 0, 3, 3, 25
view bytes: 0
allocated (Itensor:Sacado::Fad::DFad<double>:<Dim,Dim>(3,3), 3, 3, 25
view bytes: 1800
workset needs 196159000 bytes

This workset has 882 cells, and needs 196MB of memory. This means each cell has 222KB of state variable data. Now, we take into account that this is happing for all of 151 element blocks, and we get (196MB * 151 = 30GB), and indeed 30GB was the listed memory usage of the process during Jacobian computation.

So, I think the #1 contributor to memory usage in this case is the use of way too many element blocks. My assumption is that this was the only way to get spatially varying material properties into Albany?

Assuming I'm right, I think the most correct solution would be implement a more general system for material properties that doesn't force people to use tons of element blocks. The second most correct solution is to refactor Albany to only preallocate and compute on element block at a time.

djlittl commented 7 years ago

If the only difference between the element blocks is the crystallographic orientation, then we already have a fix for this. Jay can put the orientations on the mesh (.g file) and can then treat the entire model as a single block.

I think the script that puts the orientations on the mesh currently only functions properly if all the blocks are crystal plasticity blocks (i.e., it needs work to support Schwarz coupling problems.)

Copy-and-pasting from an old email:

The idea is that you’ll first create a genesis mesh (e.g., with Cubit) in which each grain is a separate block. This is what we’ve been doing. You’ll also create a text file that contains the rotation matrices for each block. You’ll then run a python script like so:

imprint_orientation_on_mesh.py Mesh.g rotation_matrices.txt --combine_blocks

This will create a new genesis file called Mesh_orientation.g that has the orientation matrices stored at each element as element attributes. The “—combine_blocks” option, as the name suggests, combines all the grains into a single block. This is the use case I envision most folks using: the material properties are the same across the entire polycrystal, thus you only need one block. The orientations are of course not the same over the entire polycrystal, but they’re on the mesh now so we don’t need separate blocks for them. If you do want different material properties on different blocks, just don’t use the “—combine_blocks” option, and the original blocks should be preserved.

Once the orientation is on your mesh, you can omit the basis function definitions in the material xml file and instead include the parameter:

<Parameter name="Read Lattice Orientation From Mesh” type="bool" value="true"/>

There is atest called OrientationOnMesh that exercises the new functionality.

ibaned commented 7 years ago

Excellent! I recommend you move forward with always using this --combine_blocks system, and let us know how it goes in terms of memory use.

lxmota commented 7 years ago

But there is still the case of the high memory usage for J2 and even elastic materials.

@jwfoulk could you provide your input files for the Neohookean case?

ibaned commented 7 years ago

@lxmota I suspect those cases still used 151 element blocks, but lets wait for confirmation on that.

jwfoulk commented 7 years ago

Hi everyone. Yes - the cases with J2 and even Neokookean do indeed employ 151 blocks. I wanted to use both the same mesh and the number of blocks. It seems that Dan's math is correct. Great find Dan - thank you. Dave is also correct in that for the CP model, we can combine blocks. I'm glad to experiment with a single block and get back to everyone soon. I actually like keeping all the blocks separate because I can easily isolate grains and look at grain-averaged quantities. That said, let's just get more data and make the best decision. Going through one block at a time in parallel does make sense in that one would not be computing on all blocks for each processor. Ok. Let me do some work and get back to everyone soon. Thanks again. -Jay

ibaned commented 7 years ago

Note that there may be ways for you to keep many Exodus element blocks and not have all this preallocated memory in Albany. One way is to define a single Albany element block despite having many Exodus element blocks (we implement mappings like this for the PUMI discretization). There is also the "all element block have same physics" boolean which I know is supported in PUMI and is probably also supported by the STK discretization.

djlittl commented 7 years ago

One other option is that we could run as a single block, but then post-process the exodus output and map it back to the original multi-block structure. This wouldn't be too difficult, basically a reverse operation of what imprint_orientation_on_mesh.py is doing with the --combine_blocks option. Then you could look at individual grains in Paraview or whatever. Obviously the post-processing script would need access to the original multi-block genesis file in addition to the exodus output from LCM.

ibaned commented 7 years ago

Even if you just define a field over elements that assigns to them a grain number, you can use ParaView Threshold on that. So yea, lots of ideas here.

agsalin commented 7 years ago

Great forensics Dan.

Seems to me that setting Worset Size to something small (5) would solve the memory problem without needed to do anything with exodus or paraview. The workset size was meant to amortize some of the function calling overhead by doing a chunk of work at a time (using a chunk of temporary memory), but I expect there is diminishing returns increasing from 5 to 50 for expensive models.

ibaned commented 7 years ago

Yea, we identified workset size as a quick workaround early on. There was still concern about why there was so much data per element, which I guess we now know.

I'd actually be interested in seeing plots of runtime versus workset size (we already know what memory use looks like). I suspect there are still benefits going from 5 to 50, especially considering that things like Phalanx registration and allocation are done at the workset level. Its not just function calls.

jwfoulk commented 7 years ago

Hi again - I just wanted to update everyone regarding results when combining all 151 blocks to a single block. To remind everyone of the results with crystal plasticity on 151 blocks, the memory usage was roughly 37 Gb, 34 Gb/processor, 34 Gb/processor and 34 Gb/processor for 1, 2, 3, and 4 processors (with the default workset size of 1000). The same simulation with one block now consumes roughly 3 Gb, 1.3 Gb/processor, and 0.6 Gb/processor for 1, 4, and 32 processors (with a specified workset size of 1000). The issue clearly lies with the number of element blocks and that Albany allocates memory for the workset for each block on every processor. Clearly there are workarounds in both the simulation process (only employing 1 block) or explicitly specifying the workset size. I will probably employ a smaller workset size in the near term. Many thanks to Dan and for everyone's help in diagnosing this issue. -Jay