sandialabs / Albany

Sandia National Laboratories' Albany multiphysics code
Other
271 stars 90 forks source link

Rebalance Logic #590

Closed mperego closed 3 years ago

mperego commented 4 years ago

Rebalance is always turned on when Use Serial Mesh is true.

  if(rebalance || (useSerialMesh && comm->getSize() > 1)) {
     rebalanceAdaptedMeshT(params, comm);

https://github.com/SNLComputation/Albany/blob/master/src/disc/stk/Albany_GenericSTKMeshStruct.cpp#L639

In fact rebalance can change the number of worksets per process. Even if Workset Size is set ot -1, one could have multiple workset per process. I encountered this with extruded meshes using ExtrudedSTKMeshStruct. This could be a problem,e.g. for memoization. I propose to turn on rebalance only when Rebalance Mesh is true, without assuming that one always wants rebalance also when Use Serial Mesh is true.

@bartgol @gahansen

bartgol commented 4 years ago

If you have 'Use Serial Mesh' and you don't rebalance, don't you end up with the mesh only on one processor? That was my understanding: rebalance cause the (serial) mesh is all on rank 0.

mperego commented 4 years ago

No.. not sure how it works but turning off re-balancing doesn't make the run serial.

bartgol commented 4 years ago

So even with rebalancing off, dofs are divided among ranks?

mperego commented 4 years ago

Yes.. unless I'm going crazy (which is always a possibility)

bartgol commented 4 years ago

Well, if that's the case, then, as you said, we should change the code, and only do rebalance if explicitly requested.

gahansen commented 4 years ago

I think @agsalin converted Albany from an approach that I wrote long ago, where we used rebalance to "spread" the mesh across multiple processors after we input the serial mesh on PE 0. It looks like Andy converted the code to use Ioss's parallel I/O capabilities to do this instead, I suspect that we do not need or want to have the rebalance word in there any longer. Indeed, I suspect rebalance is only potentially useful if we either use adaptivity (via percept), or if the physics balance changes over time. I be glad to take a quick, deeper look and this and prune it (and percept, along with other adaptivity logic that is likely stale now). We can revisit a new rebalance later if needed.

bartgol commented 4 years ago

I would keep the rebalance cabability. It seems the only modification to do is a one line change in the GenericSTKMeshStruct, removing || (...) from the if statement.

mperego commented 4 years ago

@bartgol I tried to remove the || (...) from the if statement. The tests work except from yours on Hydrology. I get this error:

 p=1: *** Caught standard std::exception of type 'std::runtime_error' :
2:
2:  Deprecation Error: Kokkos::deep_copy extents of views don't match: MV::DualView(0) (1)
2:  Traceback functionality not available

I noticed that these are the only tests using Gmsh. Any clue? You should be able to reproduce the error adding to the discretization list

 Rebalance Mesh: false
bartgol commented 4 years ago

Uhm, right off the bat I have no clue. I will dig a bit. Thanks.

jewatkins commented 4 years ago

I just noticed this issue. This might have affected my results in #540 so mentioning it for future reference.

bartgol commented 4 years ago

Whoops, I forgot to follow up on this. Sorry. I will (try to) get to this tomorrow.

mperego commented 4 years ago

@bartgol sorry for bugging you again today.. do you have a chance to look at this issue? :-)

bartgol commented 4 years ago

...and I forgot again. Sorry. Yes, I'll look into it today.

bartgol commented 4 years ago

You're right. With Use Serial Mesh: true, the mesh is still repartitioned among ranks, even if I comment out the second half of the if statement. Not just that, but if I do run the rebalance, the code crashes during the call to buildCellSideNodeNumerationMap (I don't think I'm shuffling around the numeration maps in case of rebalancing, which is wrong).

I think Mauro's solution of enabling rebalance only if explicitly requested is fine. I would also put an error if the mesh has side_maps_present=true.

mperego commented 4 years ago

You're right. With Use Serial Mesh: true, the mesh is still repartitioned among ranks, even if I comment out the second half of the if statement. Not just that, but if I do run the rebalance, the code crashes during the call to buildCellSideNodeNumerationMap (I don't think I'm shuffling around the numeration maps in case of rebalancing, which is wrong).

I think Mauro's solution of enabling rebalance only if explicitly requested is fine. I would also put an error if the mesh has side_maps_present=true.

But what about the failure wiht the hydrology test I mentioned above?

bartgol commented 4 years ago

Ah, sorry, I missed that. Let me check.

mperego commented 4 years ago

Hi Luca, sorry for pinging you again on this. Did you find out if it is safe for the hydrology test to disable the default Rebalance for Use Serial Mesh: true ?

bartgol commented 4 years ago

I removed the || (...) in GenericSTKMeshStruct, and Hydrology works on my laptop, with or without the Rebalance Mesh: false option in the input file.

Additionally, I noticed that the Use Serial Mesh is present in a few STK mesh structs, but if we remove the || (...) part from the if statmentit, then it only makes sense for Ioss meshes. So we should probably purge it from the others...

mperego commented 4 years ago

@bartgol I opened PR #626. As you can see I had to turn on Rebalance for the Hydology tests. If I don't I get these errors:

2: The gmsh version is: 2.2
2: [GenericSTKMeshStruct] Processing field requirements...
2:   - Computing Node Scalar field 'ice_thickness' from mathematical expression(s): h*(1-(x^2+y^2)/R^2) (with h=0.5; R=25).
2:   - Computing Node Scalar field 'surface_height' from mathematical expression(s): h*(1-(x^2+y^2)/R^2) (with h=0.5; R=25).
2:   - Filling Node Scalar field 'surface_water_input' with constant value {5.47999999999999972e+01}.
2:   - Computing Node Vector field 'basal_velocity' from mathematical expression(s): 0*r; tmp*100*((r-R1)/(L-R1))^5 (with R1=5; L=22.5; r=(x^2+y^2)^0.5; tmp=(r>=R1 ? 1.0 : 0.0)).
2:   - Skipping field 'effective_pressure' since it's listed as output. Make sure there's an evaluator set to save it!
2:   - Skipping field 'water_thickness' since it's listed as output. Make sure there's an evaluator set to save it!
2:   - Skipping field 'hydraulic_potential' since it's listed as output. Make sure there's an evaluator set to save it!
2:   - Skipping field 'ice_overburden' since it's listed as output. Make sure there's an evaluator set to save it!
2:   - Skipping field 'water_discharge' since it's listed as output. Make sure there's an evaluator set to save it!
2:
2: p=1: *** Caught standard std::exception of type 'std::runtime_error' :
2:
2:  Deprecation Error: Kokkos::deep_copy extents of views don't match: MV::DualView(0) (1)
2:  Traceback functionality not available
2:
2:
2: p=2: *** Caught standard std::exception of type 'std::runtime_error' :
2:
2:  Deprecation Error: Kokkos::deep_copy extents of views don't match: MV::DualView(0) (1)
2:  Traceback functionality not available
2:
2:
2: p=3: *** Caught standard std::exception of type 'std::runtime_error' :
2:
2:  Deprecation Error: Kokkos::deep_copy extents of views don't match: MV::DualView(0) (1)
2:  Traceback functionality not available
2:
2: terminate called after throwing an instance of 'std::runtime_error'
2: terminate called after throwing an instance of '  what():  /ascldap/users/mperego/Workspace/trilinos/sources/trilinos-src-devel/packages/teuchos/comm/src/Teuchos_StackedTimer.cpp:57:
2:
2: Throw number = 1
2:
2: Throw test that evaluated to true: true
2:
2: Stopping timer Albany Total Time But top level running timer is Albany: Setup Time
2: terminate called after throwing an instance of 'std::runtime_error'
2: std::runtime_error'
2:   what():  [s1026095:241944] *** Process received signal ***
2:   what():  /ascldap/users/mperego/Workspace/trilinos/sources/trilinos-src-devel/packages/teuchos/comm/src/Teuchos_StackedTimer.cpp:57:
mperego commented 4 years ago

Here is debug output:

#1  0x00007fffd4f746eb in Kokkos::Impl::throw_runtime_exception (msg=...)
    at /ascldap/users/mperego/Workspace/trilinos/sources/trilinos-src-devel/packages/kokkos/core/src/impl/Kokkos_Error.cpp:71
#2  0x00007fffed3df68d in Kokkos::deep_copy<double*, Kokkos::LayoutLeft, Kokkos::Serial, Kokkos::MemoryTraits<1u>, double const*, Kokkos::LayoutLeft, Kokkos::Serial, Kokkos::MemoryTraits<1u> >(Kokkos::View<double*, Kokkos::LayoutLeft, Kokkos::Serial, Kokkos::MemoryTraits<1u> > const&, Kokkos::View<double const*, Kokkos::LayoutLeft, Kokkos::Serial, Kokkos::MemoryTraits<1u> > const&, std::enable_if<(std::is_same<Kokkos::ViewTraits<double*, Kokkos::LayoutLeft, Kokkos::Serial, Kokkos::MemoryTraits<1u> >::specialize, void>::value&&std::is_same<Kokkos::ViewTraits<double const*, Kokkos::LayoutLeft, Kokkos::Serial, Kokkos::MemoryTraits<1u> >::specialize, void>::value)&&((((unsigned int)Kokkos::ViewTraits<double*, Kokkos::LayoutLeft, Kokkos::Serial, Kokkos::MemoryTraits<1u> >::rank)!=(0))||(((unsigned int)Kokkos::ViewTraits<double const*, Kokkos::LayoutLeft, Kokkos::Serial, Kokkos::MemoryTraits<1u> >::rank)!=(0))), void>::type*) (dst=..., src=...)
    at /net/sherlock.sandia.gov/storage/fast/projects/sems/install/rhel7-x86_64/sems/compiler/gcc/9.2.0/base/include/c++/9.2.0/bits/basic_string.h:6563
#3  0x00007fffed3d483c in Albany::GenericSTKMeshStruct::computeField(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Teu---Type <return> to continue, or q <return> to quit---
chos::ParameterList const&, Teuchos::RCP<Thyra::MultiVectorBase<double> >&, Teuchos::RCP<Thyra::VectorSpaceBase<double> const> const&, std::vector<stk::mesh::Entity, std::allocator<stk::mesh::Entity> > const&, bool, bool, bool, Teuchos::RCP<Teuchos::basic_FancyOStream<char, std::char_traits<char> > >) ()
    at /ascldap/users/mperego/Workspace/albany/sources/albany-src/src/disc/stk/Albany_GenericSTKMeshStruct.cpp:1420
mperego commented 4 years ago

Here's line 1420:

Kokkos::deep_copy(getNonconstDeviceData(field_mv->col(idim)),result_view_1d);
bartgol commented 4 years ago

Interesting. I wonder why I don't get the error on my laptop. I will double check. (next week, I'm off tomorrow).

bartgol commented 3 years ago

@mperego I think we fixed this, right?

mperego commented 3 years ago

@bartgol yes. thanks for pointing this out. Closing.