Open markcmiller86 opened 1 month ago
If I disable mfem, I get the same problem with conduit. The engine includes and links with conduit so any dependencies for conduit get linked into the engine.
@iulian787 and @vijaysm if I disable conduit, mfem and fsm in my build, I can get the engine to run and use MOAB in parallel correctly...
That is excellent news! Thanks for checking this thoroughly @markcmiller86
When you say MOAB parallel engine is working correctly, I assume this means there are no HDF5 property table errors that we were seeing before? I am still puzzled as to why the Ubuntu 22 version of engine_par
is still failing (eventhough ldd
verification showed that it was not linked against HDF5-serial).
When you say MOAB parallel engine is working correctly, I assume this means there are no HDF5 property table errors that we were seeing before?
Yes, correct.
I am still puzzled as to why the Ubuntu 22 version of
engine_par
is still failing (eventhoughldd
verification showed that it was not linked against HDF5-serial).
I am too but I believe linux is more lenient about shared lib dependencies and so think the engine is still loading the serial hdf5 library. An strace would confirm that.
Was chatting with @brugger1 about this. One idea he has...build engine_par
against hdf5_mpi
instead of hdf5
. That would fix the collision with MOAB
parallel database plugin.
But, it would mean we continue to link the engine (and mdsever) against hdf5
and that would mean nobody would be allowed to build a custom plugin using a different version of hdf5
than what VisIt was built with.
If VisIt was not on such an ancient version of hdf5
, that is probably ok. But, because we are on such an ancient version of hdf5
, it would likely conflict with any plugin developer using many of the newer features in newer versions of hdf5
.
So, we would need to upgrade hdf5
in VisIt.
That said, I still think it would be best to get hdf5
out of the engine and mdserver if possible.
@markcmiller86 conduit and mfem were added as dependencies when avt/Blueprint and avt/MFEM were added. With those additions, we also get whatever dependencies conduit and mfem have.
@markcmiller86 conduit and mfem were added as dependencies when avt/Blueprint and avt/MFEM were added. With those additions, we also get whatever dependencies conduit and mfem have.
Yes, that is right. So, it may mean we have to build those libs two different ways...one way for VisIt components and another way for database plugins...only the latter of which can depend on things like hdf5, netcdf, etc.
@markcmiller86 Both Conduit and MFEM are used outside of the database plugins. We have avt libs that provide MFEM to VTK and Conduit to VTK as a general service for the engine.
While Conduit's I/O that uses HDF5 does not need to link HDF5, I think that MFEM actually links Conduit relay, which does link HDF5.
In general unless we have a fully name mangled serial hdf5 and mpi hdf5, I think we have an issue regardless if it's DB only vs in the engine. Yes we can disable plugins, but that approach won't work for an install that can be widely used
Both Conduit and MFEM are used outside of the database plugins.
Sure...but I guess my question is...what do conduit and MFEM need to do in the way of I/O with HDF5 in the engine and mdserver? I don't think the engine or mdserver need to do any I/O with or without HDF5 and so the question remains...why do business this way? It can't be for the convenience of a 3rd party lib dependency that isn't designed to build without HDF5?
As an aside, as things are designed now, no one building a custom HDF5 plugin would be able to build against an hdf5 version other than 1.8.XX (1.8.14 is over 10 years old now). So, someone wishing to use newer features in HDF5 would simply not be able to use a newer HDF5. That paricular issue is solved, of course, by updating to newer HDF5 in VisIt. But, that only minimizes that particular issue. It doesn't fix it.
The answer is simple: MFEM does not have multiple libraries that partition features based on dependencies.
Those features are either on or off for a build of mfem.
Conduit has multiple libraries - (relay is the one with all the i/o deps), but if MFEM is using Conduit, it will also link those I/O libs.
We are going to explore building MPI enabled HDF5 will work for all cases (engine_ser and engine_par)
Ok, I tried a simple test with my build of VisIt 3.4.1. on macOS where I have disabled MFEM and conduit. But, I do have things like Silo (which is using HDF5 in serial) and MOAB (using HDF5 in serial in a serial engine and in parallel in a parallel engine).
./bin/visit -np 4
(confirm I've got a parallel engine)multi_ucd3d.silo
in the silo_hdf5_test_data
directory (confirm it is indeed Silo/HDF5 data...it is)ELEM_GLOBAL_ID
...it plots fine...in particular, no hdf5 api trace errors on stdout/stderr...which we have seen when its wound up confusing parallel HDF5 with serial HDF5So, this works. And, that is because we DO NOT load plugin shared libraries using RTLD_GLOBAL
which would make the symbols from the loaded library visible in the global namespace of the calling executable. We DO NOT specify RTLD_LOCAL
either but it turns out if neither is specified, RTLD_LOCAL
is the default behavior (on both macOS and Linux). See related ChatGPT discussion about this.
Below, I use macOS lsof
to report which libraries are actually loaded into the running engine_par
process using one of the PIDs...You can see it has BOTH serial and parallel HDF5 libraries loaded.
ps -ef | grep engine_par
3640 89593 89592 0 4:24PM ?? 0:02.03 /Users/miller86/visit/visit/34rc/build/exe/engine_par -plugindir /Users/miller86/.visit/3.4.1/darwin-x86_64/plugins:/Users/miller86/visit/visit/34rc/build/plugins -visithome /Users/miller86/visit/visit/34rc/build -visitarchhome /Users/miller86/visit/visit/34rc/build/ -dv -host 127.0.0.1 -port 5600 -key 1859f91da393c08335dc
3640 89594 89592 0 4:24PM ?? 0:11.80 /Users/miller86/visit/visit/34rc/build/exe/engine_par -plugindir /Users/miller86/.visit/3.4.1/darwin-x86_64/plugins:/Users/miller86/visit/visit/34rc/build/plugins -visithome /Users/miller86/visit/visit/34rc/build -visitarchhome /Users/miller86/visit/visit/34rc/build/ -dv -host 127.0.0.1 -port 5600 -key 1859f91da393c08335dc
3640 89595 89592 0 4:24PM ?? 0:11.84 /Users/miller86/visit/visit/34rc/build/exe/engine_par -plugindir /Users/miller86/.visit/3.4.1/darwin-x86_64/plugins:/Users/miller86/visit/visit/34rc/build/plugins -visithome /Users/miller86/visit/visit/34rc/build -visitarchhome /Users/miller86/visit/visit/34rc/build/ -dv -host 127.0.0.1 -port 5600 -key 1859f91da393c08335dc
3640 89596 89592 0 4:24PM ?? 0:11.82 /Users/miller86/visit/visit/34rc/build/exe/engine_par -plugindir /Users/miller86/.visit/3.4.1/darwin-x86_64/plugins:/Users/miller86/visit/visit/34rc/build/plugins -visithome /Users/miller86/visit/visit/34rc/build -visitarchhome /Users/miller86/visit/visit/34rc/build/ -dv -host 127.0.0.1 -port 5600 -key 1859f91da393c08335dc
[scratlantis:5.5.0/darwin-x86_64/lib] miller86% lsof -p 89593 | grep hdf
engine_pa 89593 miller86 txt REG 1,4 3910464 1282989090 /Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/hdf5/1.8.14/darwin-x86_64/lib/libhdf5.9.dylib
engine_pa 89593 miller86 txt REG 1,4 5073584 1300738963 /Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/moab_mpi/5.5.0-hdf5-1.14.3/darwin-x86_64/lib/libMOAB_mpi.5.dylib
engine_pa 89593 miller86 txt REG 1,4 9282624 1300717741 /Users/miller86/visit/visit/34rc/release/build-mb-3.4.1-darwin-21-x86_64-release/thirdparty_shared/third_party/hdf5_mpi/1.14.3/darwin-x86_64/lib/libhdf5_mpi.310.dylib
@iulian787 and @vijaysm one option we're considering here is to do away with serial/parallel builds of HDF5. We would build only parallel HDF5 and everything in VisIt that depended on HDF5 would be linked to that one, single parallel HDF5. The "serial" tools would be have to be linked with -lmpi
for example, but would, in theory anyway, never reference the MPI symbols in them.
What would you think of this?
@cyrush if we can do that with HDF5, why can't we do that with all of VisIt and get away from building all of VisIt with _par
and _ser
variants of everything. We just build everything parallel and agree we never reference the mpi symbols when running in serial?
The "serial" tools would be have to be linked with
-lmpi
for example, but would, in theory anyway, never reference the MPI symbols in them.What would you think of this?
Excellent! This was my suggestion long time back. I do this all the time in my workflows using MPI wrappers to build every library in my systems, whether it is serial code or MPI aware one.
Then we would just build MOAB+HDF5 without worrying about serial builds, with a guarantee that only MPI aware HDF5 will ever be loaded by Visit. Would also simplify builds in general and reduce distribution size :-)
@cyrush if we can do that with HDF5, why can't we do that with all of VisIt and get away from building all of VisIt with
_par
and_ser
variants of everything. We just build everything parallel and agree we never reference the mpi symbols when running in serial?
Hi Mark, In VisIt itself - we use a single source to produce both serial and MPI libs. Things aren't partitioned.
This is convenient for many filters that share logic and then add extra communication for the MPI case.
MPI support is controlled by compiler defines, which yield the serial and parallel libs. It would require refactoring and a runtime (instead of compile time) switch to be added.
It's possible to do, but would be a major change.
MPI support is controlled by compiler defines, which yield the serial and parallel libs.
@cyrush...right...we simply adjust those #if PARALLEL
code blocks to all be run-time conditionals and then we build only MPI-enabled object files and libs. This would probably halve our compile time, halve our distribution sizes and just generally simplify a lot of things in our CMake logic, build logic, plugin logic, etc.
Summarizing...
Ok, so inputs from @cyrush, @qkoziol, and @vijaysm all suggest the right way to proceed is to do away with building dependencies in different ways (e.g. with and without MPI) and just know that running a serial VisIt will never reference any MPI enabled code blocks in VisIt itself or any dependencies. It means a serial VisIt engine is still linked with -lmpi
for example to satisfy the linker.
Above, @cyrush mentioned another issue I hadn't really appreciated before digging into this in detail. In VisIt, we have a lot of code blocks of the form...
#if PARALLEL
// do something for parallel with MPI_Xxx() calls
#else
// do it the serial way
#endif
This really does mean you have to compile two different ways to get two different behaviors. To do the same thing in VisIt proper with various libavtThisOrThat.so
as we are aiming with TPL libs, we would need all those to blocks to be chosen at run time, not compile time.
We're kinda forced into this situation because libraries we want to use in VisIt proper such as MFEM have an indirect dependency on HDF5. So, the engine is going to wind up getting linked with an HDF5 regardless.
But, because VisIt's HDF5 is ancient (1.8.14), we really must upgrade that asap to latest HDF5 on develop
. We are ok leaving it a 1.8.14 on the RC.
Here is the work to complete for this ticket then...
develop
build_visit
to build HDF5 and MOAB only once (with MPI if build_visit
is building a parallel-enabled VisIt and without MPI if build_visit
is building only a serial VisIt)build_visit
TPLs that have followed this paradigm (e.g. ADIOS2)I suggest mov8ng all the way up to HDF5 1.14.4 🙂
Go to build dir for engine and do a
make clean; make VERBOSE=1 >& junk.out
and then grepjunk.out
forhdf5
. You will get hits. But, you should NOT get hits for hdf5. hdf5 is used only in a database plugin lib. If I look at a link of thelibengine_ser.dylib
, I get all the items listed below.We are also getting hits for conduit, condiut_relay and blueprint. See below. Again, those are only used in DB plugins and should not be being linked into the engine.
I believe the reason this is happening is the use of MFEM in the engine. That is fine. But, we cannot use an MFEM in the engine that depends on I/O libs needed only in the plugins. We need to build MFEM differently for use in the engine.
Taggiging @iulian787 and @vijaysm because this is impacting MOAB plugin which uses HDF5 in either serial or parallel and the fact that engine is loading hdf5 serial prevents parallel MOAB plugin from operating correctly.