oar-team / batsim

Batsim: Infrastructure simulator for job and I/O scheduling
GNU Lesser General Public License v3.0
30 stars 15 forks source link

[SMPI] Bad process data mapping #13

Closed mpoquet closed 5 years ago

mpoquet commented 7 years ago

The SMPI data mapping does not seem to work at all in Batsim.

How does it work currently?

Current SMPI code in Batsim is mostly based on the smpi_replay_multiple SimGrid example.

In SMPI, data is stored in these global variables:

What's the problem?

Some assumptions done in this example do NOT hold in Batsim:

Since this is not the case in Batsim, the simplest SMPI Batsim example does memory nonsense: bad process_data is read/write and this produces a double free of corruption at the end of the SMPI job, but it should have crashed before.

Reproducing the error

Batsim version

cd ${BATSIM_ROOT_DIR}
git checkout e4edb7610a2d

Building Batsim

cd ${BATSIM_ROOT_DIR}
mkdir build && cd build
cmake ..
make

Generate the test

cd ${BATSIM_ROOT_DIR}/build
ctest -R smpi_batexec # Should crash with a beautiful double free of corruption

Make valgrind analyse the execution

cd /tmp/batsim_tests/smpi_batexec/results/filler_compute_small
sed -i 's/batsim \(.*\)/valgrind batsim -q \1/g' batsim_command.sh
./batsim_command.sh

Valgrind output

==2731== Memcheck, a memory error detector
==2731== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==2731== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==2731== Command: batsim -q -p /home/carni/proj/batsim/platforms/small_platform.xml -w /home/carni/proj/batsim/workload_profiles/test_smpi_compute_only.json -e /tmp/batsim_tests/smpi_batexec/results/filler_compute_small/out -s /tmp/batsim_tests/smpi_batexec/results/filler_compute_small/socket --mmax-workload --batexec
==2731== 
[0.000000] [batsim/INFO] Workload '2bf996' corresponds to workload file '/home/carni/proj/batsim/workload_profiles/test_smpi_compute_only.json'.
[0.000000] [batsim/INFO] The maximum number of machines to use is 4.
[0.000000] [batsim/INFO] Checking whether SMPI is used or not...
[0.000000] [batsim/INFO] SMPI will be used.
[0.000000] [smpi_kernel/INFO] You did not set the power of the host running the simulation.  The timings will certainly not be accurate.  Use the option "--cfg=smpi/host-speed:<flops>" to set its value.Check http://simgrid.org/simgrid/latest/doc/options.html#options_smpi_bench for more information.
[0.000000] [batsim/INFO] Batsim's export prefix is '/tmp/batsim_tests/smpi_batexec/results/filler_compute_small/out'.
[0.000000] [batsim/INFO] The process 'workload_submitter_2bf996' has been created.
==2731== Invalid write of size 4
==2731==    at 0x51650A0: smpi_deployment_register_process (smpi_deployment.cpp:77)
==2731==    by 0x5165572: smpi_process_init (smpi_global.cpp:126)
==2731==    by 0x519299B: smpi_replay_run (smpi_replay.cpp:947)
==2731==    by 0x5AC3E1: smpi_replay_process(int, char**) (jobs_execution.cpp:28)
==2731==    by 0x4FC168C: simgrid::xbt::MainFunction<int (*)(int, char**)>::operator()() const (functional.hpp:48)
==2731==    by 0x4FC125C: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
==2731==    by 0x500DB2D: std::function<void ()>::operator()() const (functional:2136)
==2731==    by 0x500DAE8: simgrid::kernel::context::Context::operator()() (Context.hpp:94)
==2731==    by 0x500CD6C: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:304)
==2731==  Address 0xc1259c8 is 0 bytes after a block of size 8 alloc'd
==2731==    at 0x4C2AB8D: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2731==    by 0x5165476: xbt_malloc (sysdep.h:85)
==2731==    by 0x5165476: smpi_process_init (smpi_global.cpp:114)
==2731==    by 0x519299B: smpi_replay_run (smpi_replay.cpp:947)
==2731==    by 0x5AC3E1: smpi_replay_process(int, char**) (jobs_execution.cpp:28)
==2731==    by 0x4FC168C: simgrid::xbt::MainFunction<int (*)(int, char**)>::operator()() const (functional.hpp:48)
==2731==    by 0x4FC125C: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
==2731==    by 0x500DB2D: std::function<void ()>::operator()() const (functional:2136)
==2731==    by 0x500DAE8: simgrid::kernel::context::Context::operator()() (Context.hpp:94)
==2731==    by 0x500CD6C: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:304)
==2731== 
==2731== Invalid read of size 4
==2731==    at 0x5165855: smpi_process_remote_data (smpi_global.cpp:235)
==2731==    by 0x516557D: smpi_process_init (smpi_global.cpp:127)
==2731==    by 0x519299B: smpi_replay_run (smpi_replay.cpp:947)
==2731==    by 0x5AC3E1: smpi_replay_process(int, char**) (jobs_execution.cpp:28)
==2731==    by 0x4FC168C: simgrid::xbt::MainFunction<int (*)(int, char**)>::operator()() const (functional.hpp:48)
==2731==    by 0x4FC125C: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
==2731==    by 0x500DB2D: std::function<void ()>::operator()() const (functional:2136)
==2731==    by 0x500DAE8: simgrid::kernel::context::Context::operator()() (Context.hpp:94)
==2731==    by 0x500CD6C: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:304)
==2731==  Address 0xc1259c8 is 0 bytes after a block of size 8 alloc'd
==2731==    at 0x4C2AB8D: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2731==    by 0x5165476: xbt_malloc (sysdep.h:85)
==2731==    by 0x5165476: smpi_process_init (smpi_global.cpp:114)
==2731==    by 0x519299B: smpi_replay_run (smpi_replay.cpp:947)
==2731==    by 0x5AC3E1: smpi_replay_process(int, char**) (jobs_execution.cpp:28)
==2731==    by 0x4FC168C: simgrid::xbt::MainFunction<int (*)(int, char**)>::operator()() const (functional.hpp:48)
==2731==    by 0x4FC125C: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
==2731==    by 0x500DB2D: std::function<void ()>::operator()() const (functional:2136)
==2731==    by 0x500DAE8: simgrid::kernel::context::Context::operator()() (Context.hpp:94)
==2731==    by 0x500CD6C: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:304)
==2731== 
==2731== Invalid read of size 4
==2731==    at 0x5165B6B: smpi_process_mark_as_initialized (smpi_global.cpp:202)
==2731==    by 0x51929A0: smpi_replay_run (smpi_replay.cpp:948)
==2731==    by 0x5AC3E1: smpi_replay_process(int, char**) (jobs_execution.cpp:28)
==2731==    by 0x4FC168C: simgrid::xbt::MainFunction<int (*)(int, char**)>::operator()() const (functional.hpp:48)
==2731==    by 0x4FC125C: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
==2731==    by 0x500DB2D: std::function<void ()>::operator()() const (functional:2136)
==2731==    by 0x500DAE8: simgrid::kernel::context::Context::operator()() (Context.hpp:94)
==2731==    by 0x500CD6C: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:304)
==2731==  Address 0xc1259c8 is 0 bytes after a block of size 8 alloc'd
==2731==    at 0x4C2AB8D: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2731==    by 0x5165476: xbt_malloc (sysdep.h:85)
==2731==    by 0x5165476: smpi_process_init (smpi_global.cpp:114)
==2731==    by 0x519299B: smpi_replay_run (smpi_replay.cpp:947)
==2731==    by 0x5AC3E1: smpi_replay_process(int, char**) (jobs_execution.cpp:28)
==2731==    by 0x4FC168C: simgrid::xbt::MainFunction<int (*)(int, char**)>::operator()() const (functional.hpp:48)
==2731==    by 0x4FC125C: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
==2731==    by 0x500DB2D: std::function<void ()>::operator()() const (functional:2136)
==2731==    by 0x500DAE8: simgrid::kernel::context::Context::operator()() (Context.hpp:94)
==2731==    by 0x500CD6C: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:304)
==2731== 
==2731== Invalid read of size 4
==2731==    at 0x5165B95: smpi_process_mark_as_initialized (smpi_global.cpp:203)
==2731==    by 0x51929A0: smpi_replay_run (smpi_replay.cpp:948)
==2731==    by 0x5AC3E1: smpi_replay_process(int, char**) (jobs_execution.cpp:28)
==2731==    by 0x4FC168C: simgrid::xbt::MainFunction<int (*)(int, char**)>::operator()() const (functional.hpp:48)
==2731==    by 0x4FC125C: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
==2731==    by 0x500DB2D: std::function<void ()>::operator()() const (functional:2136)
==2731==    by 0x500DAE8: simgrid::kernel::context::Context::operator()() (Context.hpp:94)
==2731==    by 0x500CD6C: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:304)
==2731==  Address 0xc1259c8 is 0 bytes after a block of size 8 alloc'd
==2731==    at 0x4C2AB8D: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2731==    by 0x5165476: xbt_malloc (sysdep.h:85)
==2731==    by 0x5165476: smpi_process_init (smpi_global.cpp:114)
==2731==    by 0x519299B: smpi_replay_run (smpi_replay.cpp:947)
==2731==    by 0x5AC3E1: smpi_replay_process(int, char**) (jobs_execution.cpp:28)
==2731==    by 0x4FC168C: simgrid::xbt::MainFunction<int (*)(int, char**)>::operator()() const (functional.hpp:48)
==2731==    by 0x4FC125C: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
==2731==    by 0x500DB2D: std::function<void ()>::operator()() const (functional:2136)
==2731==    by 0x500DAE8: simgrid::kernel::context::Context::operator()() (Context.hpp:94)
==2731==    by 0x500CD6C: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:304)
==2731== 
==2731== Invalid read of size 4
==2731==    at 0x5165BE3: smpi_process_set_replaying (smpi_global.cpp:208)
==2731==    by 0x51929AA: smpi_replay_run (smpi_replay.cpp:949)
==2731==    by 0x5AC3E1: smpi_replay_process(int, char**) (jobs_execution.cpp:28)
==2731==    by 0x4FC168C: simgrid::xbt::MainFunction<int (*)(int, char**)>::operator()() const (functional.hpp:48)
==2731==    by 0x4FC125C: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
==2731==    by 0x500DB2D: std::function<void ()>::operator()() const (functional:2136)
==2731==    by 0x500DAE8: simgrid::kernel::context::Context::operator()() (Context.hpp:94)
==2731==    by 0x500CD6C: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:304)
==2731==  Address 0xc1259c8 is 0 bytes after a block of size 8 alloc'd
==2731==    at 0x4C2AB8D: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2731==    by 0x5165476: xbt_malloc (sysdep.h:85)
==2731==    by 0x5165476: smpi_process_init (smpi_global.cpp:114)
==2731==    by 0x519299B: smpi_replay_run (smpi_replay.cpp:947)
==2731==    by 0x5AC3E1: smpi_replay_process(int, char**) (jobs_execution.cpp:28)
==2731==    by 0x4FC168C: simgrid::xbt::MainFunction<int (*)(int, char**)>::operator()() const (functional.hpp:48)
==2731==    by 0x4FC125C: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
==2731==    by 0x500DB2D: std::function<void ()>::operator()() const (functional:2136)
==2731==    by 0x500DAE8: simgrid::kernel::context::Context::operator()() (Context.hpp:94)
==2731==    by 0x500CD6C: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:304)
==2731== 
==2731== Invalid read of size 4
==2731==    at 0x5165C10: smpi_process_set_replaying (smpi_global.cpp:209)
==2731==    by 0x51929AA: smpi_replay_run (smpi_replay.cpp:949)
==2731==    by 0x5AC3E1: smpi_replay_process(int, char**) (jobs_execution.cpp:28)
==2731==    by 0x4FC168C: simgrid::xbt::MainFunction<int (*)(int, char**)>::operator()() const (functional.hpp:48)
==2731==    by 0x4FC125C: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
==2731==    by 0x500DB2D: std::function<void ()>::operator()() const (functional:2136)
==2731==    by 0x500DAE8: simgrid::kernel::context::Context::operator()() (Context.hpp:94)
==2731==    by 0x500CD6C: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:304)
==2731==  Address 0xc1259c8 is 0 bytes after a block of size 8 alloc'd
==2731==    at 0x4C2AB8D: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2731==    by 0x5165476: xbt_malloc (sysdep.h:85)
==2731==    by 0x5165476: smpi_process_init (smpi_global.cpp:114)
==2731==    by 0x519299B: smpi_replay_run (smpi_replay.cpp:947)
==2731==    by 0x5AC3E1: smpi_replay_process(int, char**) (jobs_execution.cpp:28)
==2731==    by 0x4FC168C: simgrid::xbt::MainFunction<int (*)(int, char**)>::operator()() const (functional.hpp:48)
==2731==    by 0x4FC125C: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
==2731==    by 0x500DB2D: std::function<void ()>::operator()() const (functional:2136)
==2731==    by 0x500DAE8: simgrid::kernel::context::Context::operator()() (Context.hpp:94)
==2731==    by 0x500CD6C: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:304)
==2731== 
==2731== Invalid read of size 4
==2731==    at 0x51659F1: smpi_process_finalize (smpi_global.cpp:174)
==2731==    by 0x51932FD: smpi_replay_run (smpi_replay.cpp:1040)
==2731==    by 0x5AC3E1: smpi_replay_process(int, char**) (jobs_execution.cpp:28)
==2731==    by 0x4FC168C: simgrid::xbt::MainFunction<int (*)(int, char**)>::operator()() const (functional.hpp:48)
==2731==    by 0x4FC125C: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
==2731==    by 0x500DB2D: std::function<void ()>::operator()() const (functional:2136)
==2731==    by 0x500DAE8: simgrid::kernel::context::Context::operator()() (Context.hpp:94)
==2731==    by 0x500CD6C: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:304)
==2731==  Address 0xc1259cc is 4 bytes after a block of size 8 alloc'd
==2731==    at 0x4C2AB8D: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2731==    by 0x5165476: xbt_malloc (sysdep.h:85)
==2731==    by 0x5165476: smpi_process_init (smpi_global.cpp:114)
==2731==    by 0x519299B: smpi_replay_run (smpi_replay.cpp:947)
==2731==    by 0x5AC3E1: smpi_replay_process(int, char**) (jobs_execution.cpp:28)
==2731==    by 0x4FC168C: simgrid::xbt::MainFunction<int (*)(int, char**)>::operator()() const (functional.hpp:48)
==2731==    by 0x4FC125C: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
==2731==    by 0x500DB2D: std::function<void ()>::operator()() const (functional:2136)
==2731==    by 0x500DAE8: simgrid::kernel::context::Context::operator()() (Context.hpp:94)
==2731==    by 0x500CD6C: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:304)
==2731== 
[Bourassa:1_0:(3) 20621958261.156479] [smpi_replay/INFO] Simulation time 20621958261.156479
==2731== Invalid read of size 4
==2731==    at 0x51658C3: smpi_process_destroy (smpi_global.cpp:161)
==2731==    by 0x5193325: smpi_replay_run (smpi_replay.cpp:1044)
==2731==    by 0x5AC3E1: smpi_replay_process(int, char**) (jobs_execution.cpp:28)
==2731==    by 0x4FC168C: simgrid::xbt::MainFunction<int (*)(int, char**)>::operator()() const (functional.hpp:48)
==2731==    by 0x4FC125C: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
==2731==    by 0x500DB2D: std::function<void ()>::operator()() const (functional:2136)
==2731==    by 0x500DAE8: simgrid::kernel::context::Context::operator()() (Context.hpp:94)
==2731==    by 0x500CD6C: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:304)
==2731==  Address 0xc1259c8 is 0 bytes after a block of size 8 alloc'd
==2731==    at 0x4C2AB8D: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2731==    by 0x5165476: xbt_malloc (sysdep.h:85)
==2731==    by 0x5165476: smpi_process_init (smpi_global.cpp:114)
==2731==    by 0x519299B: smpi_replay_run (smpi_replay.cpp:947)
==2731==    by 0x5AC3E1: smpi_replay_process(int, char**) (jobs_execution.cpp:28)
==2731==    by 0x4FC168C: simgrid::xbt::MainFunction<int (*)(int, char**)>::operator()() const (functional.hpp:48)
==2731==    by 0x4FC125C: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1740)
==2731==    by 0x500DB2D: std::function<void ()>::operator()() const (functional:2136)
==2731==    by 0x500DAE8: simgrid::kernel::context::Context::operator()() (Context.hpp:94)
==2731==    by 0x500CD6C: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:304)
==2731== 
==2731== Invalid free() / delete / delete[] / realloc()
==2731==    at 0x4C2C20A: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2731==    by 0x50109AB: std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::function<std::function<void ()> (std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)> >::~pair() (stl_pair.h:147)
==2731==    by 0x5010978: void __gnu_cxx::new_allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::function<std::function<void ()> (std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)> > >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::function<std::function<void ()> (std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)> > >(std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::function<std::function<void ()> (std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)> >*) (new_allocator.h:124)
==2731==    by 0x5010937: void std::allocator_traits<std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::function<std::function<void ()> (std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)> > > >::destroy<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::function<std::function<void ()> (std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)> > >(std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::function<std::function<void ()> (std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)> > >&, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::function<std::function<void ()> (std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)> >*) (alloc_traits.h:467)
==2731==    by 0x5010474: std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::function<std::function<void ()> (std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)> >, true> > >::_M_deallocate_node(std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::function<std::function<void ()> (std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)> >, true>*) (hashtable_policy.h:1971)
==2731==    by 0x5015D64: std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::function<std::function<void ()> (std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)> >, true> > >::_M_deallocate_nodes(std::__detail::_Hash_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::function<std::function<void ()> (std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)> >, true>*) (hashtable_policy.h:1984)
==2731==    by 0x5015CB4: std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::function<std::function<void ()> (std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)> >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::function<std::function<void ()> (std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)> > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear() (hashtable.h:1901)
==2731==    by 0x5015C3B: std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::function<std::function<void ()> (std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)> >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::function<std::function<void ()> (std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)> > >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable() (hashtable.h:1227)
==2731==    by 0x5015B64: std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::function<std::function<void ()> (std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)>, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::function<std::function<void ()> (std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >)> > > >::~unordered_map() (unordered_map.h:98)
==2731==    by 0x5015AD6: simgrid::simix::Global::~Global() (smx_private.h:45)
==2731==    by 0x501599A: std::default_delete<simgrid::simix::Global>::operator()(simgrid::simix::Global*) const (unique_ptr.h:76)
==2731==    by 0x50166CB: std::unique_ptr<simgrid::simix::Global, std::default_delete<simgrid::simix::Global> >::reset(simgrid::simix::Global*) (unique_ptr.h:344)
==2731==  Address 0xbf73ce0 is 96 bytes inside a block of size 216 alloc'd
==2731==    at 0x4C2B1EC: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2731==    by 0x5011025: SIMIX_global_init (smx_global.cpp:201)
==2731==    by 0x4FB38DE: MSG_init_nocheck (msg_global.cpp:53)
==2731==    by 0x56B635: initialize_msg(MainArguments const&, int, char**) (batsim.cpp:413)
==2731==    by 0x56C30C: main (batsim.cpp:535)
==2731== 
==2731== 
==2731== HEAP SUMMARY:
==2731==     in use at exit: 7,547 bytes in 135 blocks
==2731==   total heap usage: 39,247 allocs, 39,113 frees, 43,983,820 bytes allocated
==2731== 
==2731== LEAK SUMMARY:
==2731==    definitely lost: 2,584 bytes in 81 blocks
==2731==    indirectly lost: 1,600 bytes in 50 blocks
==2731==      possibly lost: 0 bytes in 0 blocks
==2731==    still reachable: 3,363 bytes in 4 blocks
==2731==         suppressed: 0 bytes in 0 blocks
==2731== Rerun with --leak-check=full to see details of leaked memory
==2731== 
==2731== For counts of detected and suppressed errors, rerun with: -v
==2731== ERROR SUMMARY: 17 errors from 9 contexts (suppressed: 0 from 0)

Debugging this

Instead of running valgrind, gdb can be run (or your prefered gdb interface):

cd /tmp/batsim_tests/smpi_batexec/results/filler_compute_small
sed -i 's/batsim \(.*\)/gdb --args batsim -q \1/g' batsim_command.sh
./batsim_command.sh

Some gdb useful breakpoints:

break workload.cpp:'Workload::register_smpi_applications'
break jobs_execution.cpp:smpi_replay_process

break smpi_global.cpp:smpi_global_init
break smpi_replay.cpp:smpi_replay_run
break smpi_deployment.cpp:SMPI_app_instance_register

When executing this, we can see that the index of the first SMPI process is 2, whereas process_data has size 2.

(gdb) start
Temporary breakpoint 1 at 0x56c2a2: file /home/carni/proj/batsim/src/batsim.cpp, line 527.
Starting program: /usr/bin/batsim -q -p /home/carni/proj/batsim/platforms/small_platform.xml -w /home/carni/proj/batsim/workload_profiles/test_smpi_compute_only.json -e /tmp/batsim_tests/smpi_batexec/results/filler_compute_small/out -s /tmp/batsim_tests/smpi_batexec/results/filler_compute_small/socket --mmax-workload --batexec
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".

Temporary breakpoint 1, main (argc=12, argv=0x7fffffffdcc8) at /home/carni/proj/batsim/src/batsim.cpp:527
527     MainArguments main_args;

(gdb) break smpi_global.cpp:smpi_global_init
Breakpoint 2 at 0x7ffff7a4e694: file /home/carni/proj/simgrid-martin/src/smpi/smpi_global.cpp, line 458.

(gdb) break smpi_replay.cpp:smpi_replay_run
Breakpoint 3 at 0x7ffff7a7a989: file /home/carni/proj/simgrid-martin/src/smpi/smpi_replay.cpp, line 947.

(gdb) continue
Continuing.
[0.000000] [batsim/INFO] Workload '2bf996' corresponds to workload file '/home/carni/proj/batsim/workload_profiles/test_smpi_compute_only.json'.
[0.000000] [batsim/INFO] The maximum number of machines to use is 4.
[0.000000] [batsim/INFO] Checking whether SMPI is used or not...
[0.000000] [batsim/INFO] SMPI will be used.

Breakpoint 2, smpi_global_init () at /home/carni/proj/simgrid-martin/src/smpi/smpi_global.cpp:458
458   int smpirun=0;
(gdb) until 561
smpi_global_init () at /home/carni/proj/simgrid-martin/src/smpi/smpi_global.cpp:561
561   for (i = 0; i < process_count; i++) {

(gdb) p process_count
$1 = 2

(gdb) continue
Continuing.
[0.000000] [smpi_kernel/INFO] You did not set the power of the host running the simulation.  The timings will certainly not be accurate.  Use the option "--cfg=smpi/host-speed:<flops>" to set its value.Check http://simgrid.org/simgrid/latest/doc/options.html#options_smpi_bench for more information.

Breakpoint 3, smpi_replay_run (argc=0x7ffff06e4e5c, argv=0x7ffff06e4e50) at /home/carni/proj/simgrid-martin/src/smpi/smpi_replay.cpp:947
warning: Source file is more recent than executable.
947   smpi_process_init(argc, argv);

(gdb) step
smpi_process_init (argc=0x7ffff06e4e5c, argv=0x7ffff06e4e50) at /home/carni/proj/simgrid-martin/src/smpi/smpi_global.cpp:102
102   if (process_data == nullptr){

(gdb) until 113
smpi_process_init (argc=0x7ffff06e4e5c, argv=0x7ffff06e4e50) at /home/carni/proj/simgrid-martin/src/smpi/smpi_global.cpp:113
113     if(index_to_process_data == nullptr){

(gdb) p index
$2 = 2
mpoquet commented 7 years ago

I will try to make a workaround, in which the index_to_process_data data structure is ignored, and simply add the SMPI process index to the smpi_replay_run call.

Hacking the argv can be done but it really looks dirty to me as I don't understand them fully, I think I will just duplicate some functions to add some args.

mpoquet commented 7 years ago

I confirm that SIMIX_process_count cannot be used as the number of SMPI processes. However, what this function returns (when called at the beginning of the first SMPI job) seems to be big enough to store all SMPI processes.

In example smpi, with workload compute2 (2 jobs, a total of 4 MPI executors), index_to_process_data is allocated with a size of 5 (too big, 4 expected). In example smpi_batexec and the same input workload, the size is 4 (ok, 4 expected).

With workload compute (2 MPI executors) and example smpi, the array size is 4 (too big, 2 expected). With the same workload but example smpi_batexec, the array size is 2 (ok, 2 expected).

mpoquet commented 7 years ago

Tried to enable SimGrid's option smpi/privatize-global-variables, as it seems cleaner than the hacky index_to_process_data.

XBT_INFO("SMPI will be used.");
MSG_config("smpi/privatize-global-variables", "1");
context.workloads.register_smpi_applications(); // todo: SMPI workflows
SMPI_init();

Unfortunately, when this option is enabled, executing the smpi_batexec with the compute workload segfaults. This happens when SMPI_switch_data_segment, just after retrieving the index from the segment_index (seems OK as index got value 0).

(gdb) run
Starting program: /usr/bin/batsim -p /home/carni/proj/batsim/platforms/small_platform.xml -w /home/carni/proj/batsim/workload_profiles/test_smpi_compute_only.json -e /tmp/batsim_tests/smpi_batexec/results/filler_compute_small/out -s /tmp/batsim_tests/smpi_batexec/results/filler_compute_small/socket --mmax-workload --batexec
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[0.000000] [batsim/INFO] Workload '2bf996' corresponds to workload file '/home/carni/proj/batsim/workload_profiles/test_smpi_compute_only.json'.
[0.000000] [workload/INFO] Loading JSON workload '/home/carni/proj/batsim/workload_profiles/test_smpi_compute_only.json'...
[0.000000] [profiles/INFO] base_dir = '/home/carni/proj/batsim/workload_profiles'
[0.000000] [profiles/INFO] Filenames of profile '1': [/home/carni/proj/batsim/workload_profiles/smpi/compute_only/actions0.txt, /home/carni/proj/batsim/workload_profiles/smpi/compute_only/actions1.txt]
[0.000000] [workload/INFO] JSON workload parsed sucessfully. Read 1 jobs and 1 profiles.
[0.000000] [workload/INFO] Checking workload validity...
[0.000000] [workload/INFO] Workload seems to be valid.
[0.000000] [batsim/INFO] The maximum number of machines to use is 4.
[0.000000] [batsim/INFO] Checking whether SMPI is used or not...
[0.000000] [batsim/INFO] SMPI will be used.
[0.000000] [workload/INFO] Registering SMPI applications of workload '2bf996'...
[0.000000] [workload/INFO] Registering app. instance='1', nb_process=2
[0.000000] [workload/INFO] SMPI applications of workload '2bf996' have been registered.
[0.000000] [smpi_kernel/INFO] You did not set the power of the host running the simulation.  The timings will certainly not be accurate.  Use the option "--cfg=smpi/host-speed:<flops>" to set its value.Check http://simgrid.org/simgrid/latest/doc/options.html#options_smpi_bench for more information.
[0.000000] [machines/INFO] Creating the machines from platform file '/home/carni/proj/batsim/platforms/small_platform.xml'...
[0.000000] [machines/INFO] The name of the master host is 'master_host'
[0.000000] [machines/INFO] The name of the parallel file system host is 'pfs_host'
[0.000000] [machines/INFO] There is not Pfs_Host (parallel filesystem host).
[0.000000] [machines/INFO] The machines have been created successfully. There are 4 computing machines.
[Bourassa:1_0:(3) 0.000000] [jobs_execution/INFO] Launching smpi_replay_run

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff7a4d54d in smpi_process_init (argc=0x7ffff06e4e5c, argv=0x7ffff06e4e50) at /home/carni/proj/simgrid-martin/src/smpi/smpi_global.cpp:121
#2  0x00007ffff7a7a99c in smpi_replay_run (argc=0x7ffff06e4e5c, argv=0x7ffff06e4e50) at /home/carni/proj/simgrid-martin/src/smpi/smpi_replay.cpp:947
#3  0x00000000005ac3f2 in smpi_replay_process (argc=5, argv=0x91dad0) at /home/carni/proj/batsim/src/jobs_execution.cpp:28
#4  0x00007ffff78a968d in simgrid::xbt::MainFunction<int (*)(int, char**)>::operator() (this=0x91d750) at /home/carni/proj/simgrid-martin/include/xbt/functional.hpp:48
#5  0x00007ffff78a925d in std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (__functor=...)
    at /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/6.2.1/../../../../include/c++/6.2.1/functional:1740
#6  0x00007ffff78f5b2e in std::function<void ()>::operator()() const (this=0x91d968)
    at /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/6.2.1/../../../../include/c++/6.2.1/functional:2136
#7  0x00007ffff78f5ae9 in simgrid::kernel::context::Context::operator() (this=0x91d960) at /home/carni/proj/simgrid-martin/src/kernel/context/Context.hpp:94
#8  0x00007ffff78f4d6d in simgrid::kernel::context::RawContext::wrapper (arg=0x91d960) at /home/carni/proj/simgrid-martin/src/kernel/context/ContextRaw.cpp:304
#9  0x0000000000000000 in ?? ()
(gdb) up 1
#1  0x00007ffff7a4d54d in smpi_process_init (argc=0x7ffff06e4e5c, argv=0x7ffff06e4e50) at /home/carni/proj/simgrid-martin/src/smpi/smpi_global.cpp:121
121       SMPI_switch_data_segment(index);
(gdb) list
116 
117     if(smpi_privatize_global_variables){
118       /* Now using segment index of the process  */
119       index = proc->segment_index;
120       /* Done at the process creation */
121       SMPI_switch_data_segment(index);
122     }
123 
124     MPI_Comm* temp_comm_world;
125     msg_bar_t temp_bar;
(gdb) p index
$1 = 0
mpoquet commented 7 years ago

Opened simgrid/simgrid#129, which should be simpler to work on.

mpoquet commented 7 years ago

Using mpoquet/simgrid@587483ebe788 should make SMPI jobs work in Batsim. Unfortunately, this fork currently breaks other SimGrid usages.

mpoquet commented 7 years ago

Make sure to use -Denable_compile_optimizations=ON when you compile the SG fork. Otherwise, Batsim segfaults when applications are registered...

mickours commented 5 years ago

Seems to be working now (87939463fc) with the fix I propose to simgrid (https://framagit.org/simgrid/simgrid/commit/35a389f7c71363e88bc1d4537390305fc24a959b).

mickours commented 5 years ago

@mpoquet Are we closing this?

mpoquet commented 5 years ago

Mmh this issue is old, the description at the beginning is wrong now.

The SimGrid CI robots detected some errors in your patch, maybe we should wait until validation from the SG community?

mpoquet commented 5 years ago

I just launched valgrind on a SMPI example (simple-smpimixed-small-fcfs) with Batsim commit 548cd4b (SimGrid f9b70a2) and the memory errors seems to have disappeared. Some leaks are there but this is not the same issue.

Should we close this issue?

mickours commented 5 years ago

Yes!