Closed mpoquet closed 5 years ago
I will try to make a workaround, in which the index_to_process_data data structure is ignored, and simply add the SMPI process index to the smpi_replay_run call.
Hacking the argv can be done but it really looks dirty to me as I don't understand them fully, I think I will just duplicate some functions to add some args.
I confirm that SIMIX_process_count cannot be used as the number of SMPI processes. However, what this function returns (when called at the beginning of the first SMPI job) seems to be big enough to store all SMPI processes.
In example smpi, with workload compute2 (2 jobs, a total of 4 MPI executors), index_to_process_data is allocated with a size of 5 (too big, 4 expected). In example smpi_batexec and the same input workload, the size is 4 (ok, 4 expected).
With workload compute (2 MPI executors) and example smpi, the array size is 4 (too big, 2 expected). With the same workload but example smpi_batexec, the array size is 2 (ok, 2 expected).
Tried to enable SimGrid's option smpi/privatize-global-variables, as it seems cleaner than the hacky index_to_process_data.
XBT_INFO("SMPI will be used.");
MSG_config("smpi/privatize-global-variables", "1");
context.workloads.register_smpi_applications(); // todo: SMPI workflows
SMPI_init();
Unfortunately, when this option is enabled, executing the smpi_batexec with the compute workload segfaults. This happens when SMPI_switch_data_segment, just after retrieving the index from the segment_index (seems OK as index got value 0).
(gdb) run
Starting program: /usr/bin/batsim -p /home/carni/proj/batsim/platforms/small_platform.xml -w /home/carni/proj/batsim/workload_profiles/test_smpi_compute_only.json -e /tmp/batsim_tests/smpi_batexec/results/filler_compute_small/out -s /tmp/batsim_tests/smpi_batexec/results/filler_compute_small/socket --mmax-workload --batexec
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[0.000000] [batsim/INFO] Workload '2bf996' corresponds to workload file '/home/carni/proj/batsim/workload_profiles/test_smpi_compute_only.json'.
[0.000000] [workload/INFO] Loading JSON workload '/home/carni/proj/batsim/workload_profiles/test_smpi_compute_only.json'...
[0.000000] [profiles/INFO] base_dir = '/home/carni/proj/batsim/workload_profiles'
[0.000000] [profiles/INFO] Filenames of profile '1': [/home/carni/proj/batsim/workload_profiles/smpi/compute_only/actions0.txt, /home/carni/proj/batsim/workload_profiles/smpi/compute_only/actions1.txt]
[0.000000] [workload/INFO] JSON workload parsed sucessfully. Read 1 jobs and 1 profiles.
[0.000000] [workload/INFO] Checking workload validity...
[0.000000] [workload/INFO] Workload seems to be valid.
[0.000000] [batsim/INFO] The maximum number of machines to use is 4.
[0.000000] [batsim/INFO] Checking whether SMPI is used or not...
[0.000000] [batsim/INFO] SMPI will be used.
[0.000000] [workload/INFO] Registering SMPI applications of workload '2bf996'...
[0.000000] [workload/INFO] Registering app. instance='1', nb_process=2
[0.000000] [workload/INFO] SMPI applications of workload '2bf996' have been registered.
[0.000000] [smpi_kernel/INFO] You did not set the power of the host running the simulation. The timings will certainly not be accurate. Use the option "--cfg=smpi/host-speed:<flops>" to set its value.Check http://simgrid.org/simgrid/latest/doc/options.html#options_smpi_bench for more information.
[0.000000] [machines/INFO] Creating the machines from platform file '/home/carni/proj/batsim/platforms/small_platform.xml'...
[0.000000] [machines/INFO] The name of the master host is 'master_host'
[0.000000] [machines/INFO] The name of the parallel file system host is 'pfs_host'
[0.000000] [machines/INFO] There is not Pfs_Host (parallel filesystem host).
[0.000000] [machines/INFO] The machines have been created successfully. There are 4 computing machines.
[Bourassa:1_0:(3) 0.000000] [jobs_execution/INFO] Launching smpi_replay_run
Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x00007ffff7a4d54d in smpi_process_init (argc=0x7ffff06e4e5c, argv=0x7ffff06e4e50) at /home/carni/proj/simgrid-martin/src/smpi/smpi_global.cpp:121
#2 0x00007ffff7a7a99c in smpi_replay_run (argc=0x7ffff06e4e5c, argv=0x7ffff06e4e50) at /home/carni/proj/simgrid-martin/src/smpi/smpi_replay.cpp:947
#3 0x00000000005ac3f2 in smpi_replay_process (argc=5, argv=0x91dad0) at /home/carni/proj/batsim/src/jobs_execution.cpp:28
#4 0x00007ffff78a968d in simgrid::xbt::MainFunction<int (*)(int, char**)>::operator() (this=0x91d750) at /home/carni/proj/simgrid-martin/include/xbt/functional.hpp:48
#5 0x00007ffff78a925d in std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (__functor=...)
at /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/6.2.1/../../../../include/c++/6.2.1/functional:1740
#6 0x00007ffff78f5b2e in std::function<void ()>::operator()() const (this=0x91d968)
at /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/6.2.1/../../../../include/c++/6.2.1/functional:2136
#7 0x00007ffff78f5ae9 in simgrid::kernel::context::Context::operator() (this=0x91d960) at /home/carni/proj/simgrid-martin/src/kernel/context/Context.hpp:94
#8 0x00007ffff78f4d6d in simgrid::kernel::context::RawContext::wrapper (arg=0x91d960) at /home/carni/proj/simgrid-martin/src/kernel/context/ContextRaw.cpp:304
#9 0x0000000000000000 in ?? ()
(gdb) up 1
#1 0x00007ffff7a4d54d in smpi_process_init (argc=0x7ffff06e4e5c, argv=0x7ffff06e4e50) at /home/carni/proj/simgrid-martin/src/smpi/smpi_global.cpp:121
121 SMPI_switch_data_segment(index);
(gdb) list
116
117 if(smpi_privatize_global_variables){
118 /* Now using segment index of the process */
119 index = proc->segment_index;
120 /* Done at the process creation */
121 SMPI_switch_data_segment(index);
122 }
123
124 MPI_Comm* temp_comm_world;
125 msg_bar_t temp_bar;
(gdb) p index
$1 = 0
Opened simgrid/simgrid#129, which should be simpler to work on.
Using mpoquet/simgrid@587483ebe788 should make SMPI jobs work in Batsim. Unfortunately, this fork currently breaks other SimGrid usages.
Make sure to use -Denable_compile_optimizations=ON when you compile the SG fork. Otherwise, Batsim segfaults when applications are registered...
Seems to be working now (87939463fc) with the fix I propose to simgrid (https://framagit.org/simgrid/simgrid/commit/35a389f7c71363e88bc1d4537390305fc24a959b).
@mpoquet Are we closing this?
Mmh this issue is old, the description at the beginning is wrong now.
The SimGrid CI robots detected some errors in your patch, maybe we should wait until validation from the SG community?
I just launched valgrind on a SMPI example (simple-smpimixed-small-fcfs) with Batsim commit 548cd4b (SimGrid f9b70a2) and the memory errors seems to have disappeared. Some leaks are there but this is not the same issue.
Should we close this issue?
Yes!
The SMPI data mapping does not seem to work at all in Batsim.
How does it work currently?
Current SMPI code in Batsim is mostly based on the smpi_replay_multiple SimGrid example.
In SMPI, data is stored in these global variables:
What's the problem?
Some assumptions done in this example do NOT hold in Batsim:
Since this is not the case in Batsim, the simplest SMPI Batsim example does memory nonsense: bad process_data is read/write and this produces a double free of corruption at the end of the SMPI job, but it should have crashed before.
Reproducing the error
Batsim version
Building Batsim
Generate the test
Make valgrind analyse the execution
Valgrind output
Debugging this
Instead of running valgrind, gdb can be run (or your prefered gdb interface):
Some gdb useful breakpoints:
When executing this, we can see that the index of the first SMPI process is 2, whereas process_data has size 2.