oar-team / batsim

Batsim: Infrastructure simulator for job and I/O scheduling
GNU Lesser General Public License v3.0
30 stars 15 forks source link

Pernicious occult bug from hell #21

Closed mpoquet closed 3 years ago

mpoquet commented 7 years ago

An evil bug is hidden somewhere and may appear from time to time to terrorize poor developers.

Good point: the bug seems deterministic!

Bad point: it may corrupt many things, including the stack and SimGrid memory, which leads to a lot of debugging fun. Notably:

Thought the problem was coming from the conversion of utf8 strings from boost::locale since 6283f88 solved some problems. Unfortunately the bug reappeared when I tried to implement the dynamic submission of jobs while redis is enabled.

This is probably some kind of buffer overflow but I could not find its source. So far I tried to:

In some cases valgrind showed problems when reading the message in server.cpp:

==27321== Invalid read of size 8
==27321==    at 0x4EEEA8E: SIMIX_process_self (ActorImpl.cpp:67)
==27321==    by 0x505433C: MSG_task_receive_ext_bounded (msg_gos.cpp:276)
==27321==    by 0x505498B: MSG_task_receive_ext (msg_gos.cpp:239)
==27321==    by 0x607FDD: server_process(int, char**) (server.cpp:79)
==27321==    by 0x4F5A46D: operator() (functional.hpp:48)
==27321==    by 0x4F5A46D: std::_Function_handler<void (), simgrid::xbt::MainFunction<int (*)(int, char**)> >::_M_invoke(std::_Any_data const&) (functional:1731)
==27321==    by 0x4EAD901: operator() (functional:2127)
==27321==    by 0x4EAD901: operator() (Context.hpp:94)
==27321==    by 0x4EAD901: simgrid::kernel::context::RawContext::wrapper(void*) (ContextRaw.cpp:304)
==27321==  Address 0xcfab960 is 48 bytes inside a block of size 80 alloc'd
==27321==    at 0x4C2B58F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==27321==    by 0x4EA55C1: new_context (Context.hpp:75)
==27321==    by 0x4EA55C1: simgrid::kernel::context::RawContextFactory::create_context(std::function<void ()>, void (*)(simgrid::simix::ActorImpl*), simgrid::simix::ActorImpl*) (ContextRaw.cpp:298)
==27321==    by 0x4EA6237: SIMIX_context_new (Context.cpp:32)
==27321==    by 0x4EF3E34: SIMIX_process_create (ActorImpl.cpp:291)
==27321==    by 0x4EF457F: operator() (ActorImpl.cpp:1012)
==27321==    by 0x4EF457F: kernelImmediate (simix.hpp:64)
==27321==    by 0x4EF457F: simcall_process_create(char const*, std::function<void ()>, void*, simgrid::s4u::Host*, double, s_xbt_dict*, int) (ActorImpl.cpp:1012)
==27321==    by 0x4F5F746: MSG_process_create_with_environment(char const*, std::function<void ()>, void*, simgrid::s4u::Host*, s_xbt_dict*) (msg_process.cpp:162)
==27321==    by 0x4F602BF: MSG_process_create_with_environment (msg_process.cpp:137)
==27321==    by 0x4F60EB0: MSG_process_create (msg_process.cpp:80)
==27321==    by 0x541493: start_initial_simulation_processes(MainArguments const&, BatsimContext*, bool) (batsim.cpp:559)
==27321==    by 0x541D5E: main (batsim.cpp:634)
mpoquet commented 7 years ago

Branch occult_bug_from_hell created to investigate this issue. Gitlab logs show some information about runs (after commit 1c69ac6). The dynamic_submit test reproduces the bug.

mpoquet commented 7 years ago

All the tests pass when executed from a manually installed docker environment:

However, they do not pass when the docker environment is automatically set by gitlab-ci... They do not pass either directly on my machine (without docker environment).

Any idea about the differences between a gitlab-ci docker environment and a manually installed one @mickours ?

mickours commented 7 years ago

Any idea about the differences between a gitlab-ci docker environment and a manually installed one @mickours ?

It should not have difference between manual and CI docker run.

I really don't understand this one we have to investigate...

mpoquet commented 7 years ago

Simplified the test and improved the submitter scheduler. Now (bdea7ad):

The instance work when the number of dynamic jobs to submit is 0. The instance crashes when the number of dynamic jobs is 1 or more.

Batsim logs of these two instances : https://gist.github.com/mpoquet/a9bc8236f392fb262c60b2e109dda114.

mpoquet commented 7 years ago

Some memory corruption seems to occur when Profile::from_json is called on server.cpp:285. Quick link to file: https://github.com/oar-team/batsim/blob/bdea7ad9ffef09c45750ee9345f51aa439fac580/src/server.cpp

GDB setup

(gdb) break server.cpp:250
(gdb) break server.cpp:285
(gdb) break server.cpp:288

(gdb) run
[...]

Thread 1 "batsim" hit Breakpoint 1, server_process (argc=0, argv=0xa938b0) at /home/carni/proj/batsim/src/server.cpp:252
(gdb) display workload
1: workload = (Workload *) 0xa8f670
(gdb) display *workload
2: *workload = {name = "dynamic", jobs = 0xbe7ad0, profiles = 0x9b5810}
(gdb) continue
Continuing.
[master_host:server:(2) 0.000000] [server/INFO] Parsing user-submitted job dynamic!0
[master_host:server:(2) 0.000000] /home/carni/proj/batsim/src/jobs.cpp:294: [jobs/DEBUG] Loaded job 0 from workload dynamic
[master_host:server:(2) 0.000000] [server/INFO] The profile of user-submitted job 'bouh' does not exist yet.

Thread 1 "batsim" hit Breakpoint 2, server_process (argc=0, argv=0xa938b0) at /home/carni/proj/batsim/src/server.cpp:285
1: workload = (Workload *) 0xa8f670
2: *workload = {name = "dynamic", jobs = 0xbe7ad0, profiles = 0x9b5810}
(gdb) continue
Continuing.

Thread 1 "batsim" hit Breakpoint 3, server_process (argc=0, argv=0xa938b0) at /home/carni/proj/batsim/src/server.cpp:288
1: workload = (Workload *) 0xa8f670
2: *workload = {name = <incomplete sequence \360\215\252>, jobs = 0x1, profiles = 0xe117e72d90129500}

Incriminated lines of codes

Profile * profile = Profile::from_json(job->profile,
                                       message->job_profile_description,
                                       "Invalid JSON profile received from the scheduler");
mpoquet commented 7 years ago

Investigated with @adfaure this afternoon.

We played with batsim parameters (platform, whether to enable energy...) then got lost in satanic lands.

The problem does not not occur when the platform has no energy information and when energy is disabled.

Reproduce

As 7d2fda1: ctest -R dynamic

[...]
2017-05-24 20:18:34,861 INFO: Number of successfully executed instances: 11
2017-05-24 20:18:34,862 WARNING: Number of skipped instances: 4
2017-05-24 20:18:34,862 WARNING: Information about these instances can be found in file /tmp/batsim_tests/dynamic_submit/instances/instances_info.csv
[...]

Analyzing data

library(dplyr)
instances = read.csv("/tmp/batsim_tests/dynamic_submit/instances/instances_info.csv")

# Remove garbage
instances = instances %>% select(-platform__master_node, -platform__filename,
                                 -instance_number, -instance_name, -instance_id,
                                 -explicit, -algo__name, -algo__sched_name,
                                 -energy__option)

# Sort
instances = instances %>% arrange(platform__name, nb_dyn_jobs)

works = instances %>% filter(status == 'done')
crashes = instances %>% filter(status != 'done')

print(works)

print(crashes)

Works:

   dynamic_submit_ack energy__name nb_dyn_jobs platform__name redis_enabled
1                true     disabled          10          small         false
2                true     disabled           1        cluster         false
3                true     disabled        1000          small         false
4                true     disabled        1000        cluster         false
5                true     disabled          50        cluster         false
6                true     disabled           0        cluster         false
7                true     disabled          10        cluster         false
8                true     disabled           0          homo1         false
9                true     disabled           1          small         false
10               true     disabled           0          small         false
11               true     disabled          50          small         false
   status
1    done
2    done
3    done
4    done
5    done
6    done
7    done
8    done
9    done
10   done
11   done

Crashes:

  dynamic_submit_ack energy__name nb_dyn_jobs platform__name redis_enabled
1               true     disabled           1          homo1         false
2               true     disabled          10          homo1         false
3               true     disabled          50          homo1         false
4               true     disabled        1000          homo1         false
   status
1 skipped
2 skipped
3 skipped
4 skipped
mpoquet commented 7 years ago

As commit 79aab79, dynamic job submissions seems to work, as long the energy platform is not used!

Working instances

   dynamic_submit_ack energy__name instance_id nb_dyn_jobs platform__name
1               false     disabled    514e656c           0        cluster
2               false     disabled    94c8bfc5           0        cluster
3                true     disabled    fc19341d           0        cluster
4                true     disabled    aaad1bd1           0        cluster
5               false     disabled    2ae757bc           1        cluster
6               false     disabled    31a26a10           1        cluster
7                true     disabled    3aed2a93           1        cluster
8                true     disabled    cc53682a           1        cluster
9               false     disabled    e25fd0e9          10        cluster
10              false     disabled    5097e220          10        cluster
11               true     disabled    3a62e530          10        cluster
12               true     disabled    4ffe31cb          10        cluster
13              false     disabled    79ac6c1b          50        cluster
14              false     disabled    66f5919d          50        cluster
15               true     disabled    366f2e31          50        cluster
16               true     disabled    525e0cc0          50        cluster
17              false     disabled    2337aa7b        1000        cluster
18              false     disabled    4c050031        1000        cluster
19               true     disabled    f8365fe6        1000        cluster
20               true     disabled    3f58b8f6        1000        cluster
21              false     disabled    433363a3           0          homo1
22              false     disabled    1f29e86d           0          homo1
23               true     disabled    8b42dba3           0          homo1
24               true     disabled    a0fc1585           0          homo1
25              false     disabled    6761908a           0          small
26              false     disabled    a3c9f15c           0          small
27               true     disabled    356cc2ad           0          small
28               true     disabled    f120324e           0          small
29              false     disabled    c6dad4c0           1          small
30              false     disabled    86498de1           1          small
31               true     disabled    a881961c           1          small
32               true     disabled    a0d125f9           1          small
33              false     disabled    771afe44          10          small
34              false     disabled    c8b6eaf6          10          small
35               true     disabled    3160af01          10          small
36               true     disabled    8962cf01          10          small
37              false     disabled    740a1d19          50          small
38              false     disabled    808b7b57          50          small
39               true     disabled    ef5553bd          50          small
40               true     disabled    c32c90f1          50          small
41              false     disabled    d9e2f16c        1000          small
42              false     disabled    d4343616        1000          small
43               true     disabled    a13f448b        1000          small
44               true     disabled    11a902e2        1000          small
   redis_enabled status
1          false   done
2           true   done
3          false   done
4           true   done
5          false   done
6           true   done
7          false   done
8           true   done
9          false   done
10          true   done
11         false   done
12          true   done
13         false   done
14          true   done
15         false   done
16          true   done
17         false   done
18          true   done
19         false   done
20          true   done
21         false   done
22          true   done
23         false   done
24          true   done
25         false   done
26          true   done
27         false   done
28          true   done
29         false   done
30          true   done
31         false   done
32          true   done
33         false   done
34          true   done
35         false   done
36          true   done
37         false   done
38          true   done
39         false   done
40          true   done
41         false   done
42          true   done
43         false   done
44          true   done

Crashing instances

   dynamic_submit_ack energy__name instance_id nb_dyn_jobs platform__name
1               false     disabled    e7f4f0e4           1          homo1
2               false     disabled    7befe2da           1          homo1
3                true     disabled    a864333e           1          homo1
4                true     disabled    19044c63           1          homo1
5               false     disabled    d1fd3b25          10          homo1
6               false     disabled    51400073          10          homo1
7                true     disabled    d838a772          10          homo1
8                true     disabled    09ba384f          10          homo1
9               false     disabled    c83ce282          50          homo1
10              false     disabled    193b91ed          50          homo1
11               true     disabled    05e00e14          50          homo1
12               true     disabled    e7a0343a          50          homo1
13              false     disabled    91ba7ab2        1000          homo1
14              false     disabled    4d3f1a70        1000          homo1
15               true     disabled    59934f75        1000          homo1
16               true     disabled    693cf74a        1000          homo1
   redis_enabled  status
1          false skipped
2           true skipped
3          false skipped
4           true skipped
5          false skipped
6           true skipped
7          false skipped
8           true skipped
9          false skipped
10          true skipped
11         false skipped
12          true skipped
13         false skipped
14          true skipped
15         false skipped
16          true skipped
mpoquet commented 7 years ago

Added a guard in 6a46cb3: enabling dynamic submissions with an energy platform is now forbidden.

mickours commented 5 years ago

now that we have migrated to S4u, maybe we can try to enable this again?

mpoquet commented 5 years ago

Yep, we can try to remove the guard and see if the problem still happens (but it may be a bit hard to reproduce).

mpoquet commented 3 years ago

Closing this issue as this seems fixed.