Closed mpoquet closed 3 years ago
Branch occult_bug_from_hell created to investigate this issue. Gitlab logs show some information about runs (after commit 1c69ac6). The dynamic_submit test reproduces the bug.
All the tests pass when executed from a manually installed docker environment:
However, they do not pass when the docker environment is automatically set by gitlab-ci... They do not pass either directly on my machine (without docker environment).
Any idea about the differences between a gitlab-ci docker environment and a manually installed one @mickours ?
Any idea about the differences between a gitlab-ci docker environment and a manually installed one @mickours ?
It should not have difference between manual and CI docker run.
I really don't understand this one we have to investigate...
Simplified the test and improved the submitter scheduler. Now (bdea7ad):
The instance work when the number of dynamic jobs to submit is 0. The instance crashes when the number of dynamic jobs is 1 or more.
Batsim logs of these two instances : https://gist.github.com/mpoquet/a9bc8236f392fb262c60b2e109dda114.
Some memory corruption seems to occur when Profile::from_json is called on server.cpp:285. Quick link to file: https://github.com/oar-team/batsim/blob/bdea7ad9ffef09c45750ee9345f51aa439fac580/src/server.cpp
(gdb) break server.cpp:250
(gdb) break server.cpp:285
(gdb) break server.cpp:288
(gdb) run
[...]
Thread 1 "batsim" hit Breakpoint 1, server_process (argc=0, argv=0xa938b0) at /home/carni/proj/batsim/src/server.cpp:252
(gdb) display workload
1: workload = (Workload *) 0xa8f670
(gdb) display *workload
2: *workload = {name = "dynamic", jobs = 0xbe7ad0, profiles = 0x9b5810}
(gdb) continue
Continuing.
[master_host:server:(2) 0.000000] [server/INFO] Parsing user-submitted job dynamic!0
[master_host:server:(2) 0.000000] /home/carni/proj/batsim/src/jobs.cpp:294: [jobs/DEBUG] Loaded job 0 from workload dynamic
[master_host:server:(2) 0.000000] [server/INFO] The profile of user-submitted job 'bouh' does not exist yet.
Thread 1 "batsim" hit Breakpoint 2, server_process (argc=0, argv=0xa938b0) at /home/carni/proj/batsim/src/server.cpp:285
1: workload = (Workload *) 0xa8f670
2: *workload = {name = "dynamic", jobs = 0xbe7ad0, profiles = 0x9b5810}
(gdb) continue
Continuing.
Thread 1 "batsim" hit Breakpoint 3, server_process (argc=0, argv=0xa938b0) at /home/carni/proj/batsim/src/server.cpp:288
1: workload = (Workload *) 0xa8f670
2: *workload = {name = <incomplete sequence \360\215\252>, jobs = 0x1, profiles = 0xe117e72d90129500}
Profile * profile = Profile::from_json(job->profile,
message->job_profile_description,
"Invalid JSON profile received from the scheduler");
Investigated with @adfaure this afternoon.
We played with batsim parameters (platform, whether to enable energy...) then got lost in satanic lands.
The problem does not not occur when the platform has no energy information and when energy is disabled.
As 7d2fda1: ctest -R dynamic
[...]
2017-05-24 20:18:34,861 INFO: Number of successfully executed instances: 11
2017-05-24 20:18:34,862 WARNING: Number of skipped instances: 4
2017-05-24 20:18:34,862 WARNING: Information about these instances can be found in file /tmp/batsim_tests/dynamic_submit/instances/instances_info.csv
[...]
library(dplyr)
instances = read.csv("/tmp/batsim_tests/dynamic_submit/instances/instances_info.csv")
# Remove garbage
instances = instances %>% select(-platform__master_node, -platform__filename,
-instance_number, -instance_name, -instance_id,
-explicit, -algo__name, -algo__sched_name,
-energy__option)
# Sort
instances = instances %>% arrange(platform__name, nb_dyn_jobs)
works = instances %>% filter(status == 'done')
crashes = instances %>% filter(status != 'done')
print(works)
print(crashes)
Works:
dynamic_submit_ack energy__name nb_dyn_jobs platform__name redis_enabled
1 true disabled 10 small false
2 true disabled 1 cluster false
3 true disabled 1000 small false
4 true disabled 1000 cluster false
5 true disabled 50 cluster false
6 true disabled 0 cluster false
7 true disabled 10 cluster false
8 true disabled 0 homo1 false
9 true disabled 1 small false
10 true disabled 0 small false
11 true disabled 50 small false
status
1 done
2 done
3 done
4 done
5 done
6 done
7 done
8 done
9 done
10 done
11 done
Crashes:
dynamic_submit_ack energy__name nb_dyn_jobs platform__name redis_enabled
1 true disabled 1 homo1 false
2 true disabled 10 homo1 false
3 true disabled 50 homo1 false
4 true disabled 1000 homo1 false
status
1 skipped
2 skipped
3 skipped
4 skipped
As commit 79aab79, dynamic job submissions seems to work, as long the energy platform is not used!
dynamic_submit_ack energy__name instance_id nb_dyn_jobs platform__name
1 false disabled 514e656c 0 cluster
2 false disabled 94c8bfc5 0 cluster
3 true disabled fc19341d 0 cluster
4 true disabled aaad1bd1 0 cluster
5 false disabled 2ae757bc 1 cluster
6 false disabled 31a26a10 1 cluster
7 true disabled 3aed2a93 1 cluster
8 true disabled cc53682a 1 cluster
9 false disabled e25fd0e9 10 cluster
10 false disabled 5097e220 10 cluster
11 true disabled 3a62e530 10 cluster
12 true disabled 4ffe31cb 10 cluster
13 false disabled 79ac6c1b 50 cluster
14 false disabled 66f5919d 50 cluster
15 true disabled 366f2e31 50 cluster
16 true disabled 525e0cc0 50 cluster
17 false disabled 2337aa7b 1000 cluster
18 false disabled 4c050031 1000 cluster
19 true disabled f8365fe6 1000 cluster
20 true disabled 3f58b8f6 1000 cluster
21 false disabled 433363a3 0 homo1
22 false disabled 1f29e86d 0 homo1
23 true disabled 8b42dba3 0 homo1
24 true disabled a0fc1585 0 homo1
25 false disabled 6761908a 0 small
26 false disabled a3c9f15c 0 small
27 true disabled 356cc2ad 0 small
28 true disabled f120324e 0 small
29 false disabled c6dad4c0 1 small
30 false disabled 86498de1 1 small
31 true disabled a881961c 1 small
32 true disabled a0d125f9 1 small
33 false disabled 771afe44 10 small
34 false disabled c8b6eaf6 10 small
35 true disabled 3160af01 10 small
36 true disabled 8962cf01 10 small
37 false disabled 740a1d19 50 small
38 false disabled 808b7b57 50 small
39 true disabled ef5553bd 50 small
40 true disabled c32c90f1 50 small
41 false disabled d9e2f16c 1000 small
42 false disabled d4343616 1000 small
43 true disabled a13f448b 1000 small
44 true disabled 11a902e2 1000 small
redis_enabled status
1 false done
2 true done
3 false done
4 true done
5 false done
6 true done
7 false done
8 true done
9 false done
10 true done
11 false done
12 true done
13 false done
14 true done
15 false done
16 true done
17 false done
18 true done
19 false done
20 true done
21 false done
22 true done
23 false done
24 true done
25 false done
26 true done
27 false done
28 true done
29 false done
30 true done
31 false done
32 true done
33 false done
34 true done
35 false done
36 true done
37 false done
38 true done
39 false done
40 true done
41 false done
42 true done
43 false done
44 true done
dynamic_submit_ack energy__name instance_id nb_dyn_jobs platform__name
1 false disabled e7f4f0e4 1 homo1
2 false disabled 7befe2da 1 homo1
3 true disabled a864333e 1 homo1
4 true disabled 19044c63 1 homo1
5 false disabled d1fd3b25 10 homo1
6 false disabled 51400073 10 homo1
7 true disabled d838a772 10 homo1
8 true disabled 09ba384f 10 homo1
9 false disabled c83ce282 50 homo1
10 false disabled 193b91ed 50 homo1
11 true disabled 05e00e14 50 homo1
12 true disabled e7a0343a 50 homo1
13 false disabled 91ba7ab2 1000 homo1
14 false disabled 4d3f1a70 1000 homo1
15 true disabled 59934f75 1000 homo1
16 true disabled 693cf74a 1000 homo1
redis_enabled status
1 false skipped
2 true skipped
3 false skipped
4 true skipped
5 false skipped
6 true skipped
7 false skipped
8 true skipped
9 false skipped
10 true skipped
11 false skipped
12 true skipped
13 false skipped
14 true skipped
15 false skipped
16 true skipped
Added a guard in 6a46cb3: enabling dynamic submissions with an energy platform is now forbidden.
now that we have migrated to S4u, maybe we can try to enable this again?
Yep, we can try to remove the guard and see if the problem still happens (but it may be a bit hard to reproduce).
Closing this issue as this seems fixed.
An evil bug is hidden somewhere and may appear from time to time to terrorize poor developers.
Good point: the bug seems deterministic!
Bad point: it may corrupt many things, including the stack and SimGrid memory, which leads to a lot of debugging fun. Notably:
Thought the problem was coming from the conversion of utf8 strings from boost::locale since 6283f88 solved some problems. Unfortunately the bug reappeared when I tried to implement the dynamic submission of jobs while redis is enabled.
This is probably some kind of buffer overflow but I could not find its source. So far I tried to:
In some cases valgrind showed problems when reading the message in server.cpp: