Closed adfaure closed 7 years ago
Cannot reproduce the issue.
Your scheduler does not seem to handle homogeneous MSG jobs, the provided workload is probably wrong.
Yaml to reproduce:
base_output_directory: /tmp/batsim_tests/issue36
base_variables:
batsim_dir: ${base_working_directory}
implicit_instances:
implicit:
sweep:
platform :
- {"name":"cluster", "filename":"${batsim_dir}/platforms/cluster_issue36.xml", "master_host":"master_host0"}
workload :
- {"name":"issue36", "filename": "${batsim_dir}/workload_profiles/issue36.json"}
#- {"name":"medium", "filename": "${batsim_dir}/workload_profiles/batsim_paper_workload_example.json"}
#- {"name":"tiny", "filename": "${batsim_dir}/workload_profiles/one_delay_job.json"}
algo:
- {"name":"nodegrp", "algo_name":"nodegrp"}
generic_instance:
timeout: 60
working_directory: ${base_working_directory}
output_directory: ${base_output_directory}/results/${algo[name]}_${workload[name]}_${platform[name]}
batsim_command: batsim -p ${platform[filename]} -w ${workload[filename]} -e ${output_directory}/out --config ${output_directory}/batsim.conf -m ${platform[master_host]}
sched_command: cd /home/carni/proj/rust/schedulers && RUST_BACKTRACE=1 cargo run --bin ${algo[algo_name]}
commands_before_execution:
# Generate Batsim config file
- |
#!/usr/bin/env bash
cat > ${output_directory}/batsim.conf << EOF
{
"job_submission": {
"forward_profiles": true,
"from_scheduler":{
"enabled": true,
"acknowledge": true
}
}
}
EOF
commands_before_instances:
- ${batsim_dir}/test/is_batsim_dir.py ${base_working_directory}
- ${batsim_dir}/test/clean_output_dir.py ${base_output_directory}
Output:
2017-06-07 18:48:34,176 INFO: Variables = {'algo': {'algo_name': 'nodegrp', 'name': 'nodegrp'}, 'base_output_directory': '/tmp/batsim_tests/issue36', 'base_working_directory': '/home/carni/proj/batsim', 'batsim_dir': '${base_working_directory}', 'instance_id': '402b9074', 'instance_number': 0, 'platform': {'filename': '${batsim_dir}/platforms/cluster_issue36.xml', 'master_host': 'master_host0', 'name': 'cluster'}, 'workload': {'filename': '${batsim_dir}/workload_profiles/issue36.json', 'name': 'issue36'}, 'working_directory': '/home/carni/proj/batsim', 'output_directory': '/tmp/batsim_tests/issue36/results/nodegrp_issue36_cluster'}
2017-06-07 18:48:34,176 INFO: Working directory: /home/carni/proj/batsim
2017-06-07 18:48:34,176 INFO: Output directory: /tmp/batsim_tests/issue36/results/nodegrp_issue36_cluster
2017-06-07 18:48:34,176 INFO: Executing command 'command0'
2017-06-07 18:48:34,184 INFO: command0 finished
2017-06-07 18:48:34,186 INFO: Batsim command: "batsim -p ${platform[filename]} -w ${workload[filename]} -e ${output_directory}/out --config ${output_directory}/batsim.conf -m ${platform[master_host]}"
2017-06-07 18:48:34,186 INFO: Sched command: "cd /home/carni/proj/rust/schedulers && RUST_BACKTRACE=1 cargo run --bin ${algo[algo_name]}"
2017-06-07 18:48:34,186 INFO: Waiting for socket 'tcp://localhost:28000' to be usable
2017-06-07 18:48:34,193 INFO: Socket tcp://localhost:28000 is now usable
2017-06-07 18:48:34,194 INFO: Running Batsim and Sched
2017-06-07 18:48:35,721 ERROR: Sched finished (returncode=101)
2017-06-07 18:48:35,727 ERROR: Sched stderr:
Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs
Running `target/debug/nodegrp`
thread 'main' panicked at 'ErrorImpl { code: Message("missing field `delay`"), line: 1, column: 260 } full str: {"now":0.002400,"events":[{"timestamp":0.001800,"type":"JOB_SUBMITTED","data":{"job_id":"038390!35383","job":{"profile":"8","res":2,"id":"038390!35383","subtime":0,"walltime":1800.000000},"profile":{"com":0.000000,"type":"msg_par_hg","cpu":800000000.000000}}}]}', /home/carni/proj/rust/rustbatsim/src/libbatsim/batsim.rs:430
stack backtrace:
0: std::sys::imp::backtrace::tracing::imp::unwind_backtrace
1: std::sys_common::backtrace::_print
2: std::panicking::default_hook::{{closure}}
3: std::panicking::default_hook
4: std::panicking::rust_panic_with_hook
5: std::panicking::begin_panic
6: std::panicking::begin_panic_fmt
7: batsim::batsim::read_batsim_message
at /home/carni/proj/rust/rustbatsim/src/libbatsim/batsim.rs:430
8: batsim::batsim::Batsim::get_next_message
at /home/carni/proj/rust/rustbatsim/src/libbatsim/batsim.rs:265
9: batsim::batsim::Batsim::run_simulation
at /home/carni/proj/rust/rustbatsim/src/libbatsim/batsim.rs:297
10: nodegrp::main
at ./nodegroup/src/main.rs:21
11: std::panicking::try::do_call
12: __rust_maybe_catch_panic
13: std::rt::lang_start
14: main
15: __libc_start_main
16: _start
2017-06-07 18:48:35,738 ERROR: Killing remaining processes {1824, 1778, 1780, 1823}
Oh yes indeed, I forgot to push batsim's last revesion...
Le 7 juin 2017 18:49, "mpoquet" notifications@github.com a écrit :
Cannot reproduce the issue.
Your scheduler does not seem to handle homogeneous MSG jobs, the provided workload is probably wrong.
Yaml to reproduce:
base_output_directory: /tmp/batsim_tests/issue36 base_variables: batsim_dir: ${base_working_directory} implicit_instances: implicit: sweep: platform :
- {"name":"cluster", "filename":"${batsim_dir}/platforms/cluster_issue36.xml", "master_host":"master_host0"} workload :
- {"name":"issue36", "filename": "${batsim_dir}/workload_profiles/issue36.json"}
- {"name":"medium", "filename": "${batsim_dir}/workload_profiles/batsim_paper_workload_example.json"}
#- {"name":"tiny", "filename": "${batsim_dir}/workload_profiles/one_delay_job.json"}
algo:
{"name":"nodegrp", "algo_name":"nodegrp"} generic_instance: timeout: 60 working_directory: ${base_working_directory} output_directory: ${base_outputdirectory}/results/${algo[name]}${workload[name]}_${platform[name]} batsim_command: batsim -p ${platform[filename]} -w ${workload[filename]} -e ${output_directory}/out --config ${output_directory}/batsim.conf -m ${platform[master_host]} sched_command: cd /home/carni/proj/rust/schedulers && RUST_BACKTRACE=1 cargo run --bin ${algo[algo_name]}
commands_before_execution:
Generate Batsim config file
- | #!/usr/bin/env bash cat > ${output_directory}/batsim.conf << EOF { "job_submission": { "forward_profiles": true, "from_scheduler":{ "enabled": true, "acknowledge": true } } } EOF commands_before_instances:
- ${batsim_dir}/test/is_batsim_dir.py ${base_working_directory}
- ${batsim_dir}/test/clean_output_dir.py ${base_output_directory}
Output:
2017-06-07 18:48:34,176 INFO: Variables = {'algo': {'algo_name': 'nodegrp', 'name': 'nodegrp'}, 'base_output_directory': '/tmp/batsim_tests/issue36', 'base_working_directory': '/home/carni/proj/batsim', 'batsim_dir': '${base_working_directory}', 'instance_id': '402b9074', 'instance_number': 0, 'platform': {'filename': '${batsim_dir}/platforms/cluster_issue36.xml', 'master_host': 'master_host0', 'name': 'cluster'}, 'workload': {'filename': '${batsim_dir}/workload_profiles/issue36.json', 'name': 'issue36'}, 'working_directory': '/home/carni/proj/batsim', 'output_directory': '/tmp/batsim_tests/issue36/results/nodegrp_issue36_cluster'} 2017-06-07 18:48:34,176 INFO: Working directory: /home/carni/proj/batsim 2017-06-07 18:48:34,176 INFO: Output directory: /tmp/batsim_tests/issue36/results/nodegrp_issue36_cluster 2017-06-07 18:48:34,176 INFO: Executing command 'command0' 2017-06-07 18:48:34,184 INFO: command0 finished 2017-06-07 18:48:34,186 INFO: Batsim command: "batsim -p ${platform[filename]} -w ${workload[filename]} -e ${output_directory}/out --config ${output_directory}/batsim.conf -m ${platform[master_host]}" 2017-06-07 18:48:34,186 INFO: Sched command: "cd /home/carni/proj/rust/schedulers && RUST_BACKTRACE=1 cargo run --bin ${algo[algo_name]}" 2017-06-07 18:48:34,186 INFO: Waiting for socket 'tcp://localhost:28000' to be usable 2017-06-07 18:48:34,193 INFO: Socket tcp://localhost:28000 is now usable 2017-06-07 18:48:34,194 INFO: Running Batsim and Sched 2017-06-07 18:48:35,721 ERROR: Sched finished (returncode=101) 2017-06-07 18:48:35,727 ERROR: Sched stderr: Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs Running
target/debug/nodegrp
thread 'main' panicked at 'ErrorImpl { code: Message("missing fielddelay
"), line: 1, column: 260 } full str: {"now":0.002400,"events":[{"timestamp":0.001800,"type":"JOB_SUBMITTED","data":{"job_id":"038390!35383","job":{"profile":"8","res":2,"id":"038390!35383","subtime":0,"walltime":1800.000000},"profile":{"com":0.000000,"type":"msg_par_hg","cpu":800000000.000000}}}]}', /home/carni/proj/rust/rustbatsim/src/libbatsim/batsim.rs:430 stack backtrace: 0: std::sys::imp::backtrace::tracing::imp::unwind_backtrace 1: std::sys_common::backtrace::_print 2: std::panicking::default_hook::{{closure}} 3: std::panicking::default_hook 4: std::panicking::rust_panic_with_hook 5: std::panicking::begin_panic 6: std::panicking::begin_panic_fmt 7: batsim::batsim::read_batsim_message at /home/carni/proj/rust/rustbatsim/src/libbatsim/batsim.rs:430 8: batsim::batsim::Batsim::get_next_message at /home/carni/proj/rust/rustbatsim/src/libbatsim/batsim.rs:265 9: batsim::batsim::Batsim::run_simulation at /home/carni/proj/rust/rustbatsim/src/libbatsim/batsim.rs:297 10: nodegrp::main at ./nodegroup/src/main.rs:21 11: std::panicking::try::do_call 12: rust_maybe_catch_panic 13: std::rt::lang_start 14: main 15: libc_start_main 16: _start2017-06-07 18:48:35,738 ERROR: Killing remaining processes {1824, 1778, 1780, 1823}
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/oar-team/batsim/issues/36#issuecomment-306855766, or mute the thread https://github.com/notifications/unsubscribe-auth/AHp52u_a0rvyd4snQTnOs6V68hANIXu9ks5sBtScgaJpZM4Ny6P7 .
It should be reproducible now :)
The scheduler seems to fall into an infinite loop after displaying the following message :(.
[SUBMIT_JOB { timestamp: 33407.0024, data: SubmitJob { job_id: "rej!37296", job: Job { id: "rej!37296", res: 1, profile: "11398", subtime: 32467, walltime: 86400 }, profile: Some(MsgParHg { com: 0, cpu: 1139800000000 }) } }]
The scheduler is still going into an infinite loop on my laptop :(.
[...]
[master_host0:Scheduler REQ-REP:(890) 33407.002400] [network/INFO] Sending '{"now":33407.002400,"events":[{"timestamp":33407.002400,"type":"JOB_KILLED","data":{"job_ids":["038390!37296"]}}]}'
[master_host0:Scheduler REQ-REP:(890) 33407.002400] [network/INFO] Received '{"now":33407.0024,"events":[{"type":"SUBMIT_JOB","timestamp":33407.0024,"data":{"job_id":"rej!37296","job":{"id":"rej!37296","res":1,"profile":"11398","subtime":32467.0,"walltime":86400.0},"profile":{"type":"msg_par_hg","com":0.0,"cpu":1139800000000.0}}},{"type":"EXECUTE_JOB","timestamp":33407.0006,"data":{"job_id":"038390!37297","alloc":"1-1"}}]}'
[master_host0:server:(2) 33407.003000] [server/INFO] Server received a message of type JOB_SUBMITTED_BY_DP:
[master_host0:server:(2) 33407.003000] [server/INFO] Parsing user-submitted job rej!37296
[master_host0:server:(2) 33407.003000] [server/INFO] The profile of user-submitted job '11398' does not exist yet.
[master_host0:server:(2) 33407.003600] [server/INFO] Server received a message of type SCHED_EXECUTE_JOB:
[a1:job_038390!37297:(891) 33407.003600] [jobs_execution/INFO] Creating task 'phg 37297'10832''
[a1:job_038390!37297:(891) 33407.003600] [jobs_execution/INFO] Executing task 'phg 37297'10832''
[master_host0:server:(2) 33407.004200] [server/INFO] Server received a message of type SCHED_READY:
[master_host0:Scheduler REQ-REP:(892) 33407.004200] [network/INFO] Sending '{"now":33407.004200,"events":[{"timestamp":33407.003000,"type":"JOB_SUBMITTED","data":{"job_id":"rej!37296","job":{"id":"rej!37296","res":1,"profile":"11398","subtime":32467.000000,"walltime":86400.000000},"profile":{"type":"msg_par_hg","com":0.000000,"cpu":1139800000000.000000}}}]}'
This is bizarre ...
Anyway, I think I found the issue. It happens if a job is killed multiples times before batsim acknowledge the kill. Because batsim seems to alway decrement the number of job running.
I confirm I add the check on my scheduler, it seems to work now.
Thanks for the information, I'll try to reproduce this behaviour.
Can reproduce the issue with:
base_output_directory: /tmp/batsim_tests/issue36
base_variables:
batsim_dir: ${base_working_directory}
implicit_instances:
implicit:
sweep:
platform :
- {"name":"small", "filename":"${batsim_dir}/platforms/small_platform.xml", "master_host":"master_host"}
workload :
- {"name":"tiny", "filename": "${batsim_dir}/workload_profiles/one_delay_job.json"}
algo:
- {"name":"killer", "sched_name":"killer"}
delay_before_kill: [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]
nb_kills_per_job: [0,1,2,3]
generic_instance:
timeout: 60
working_directory: ${base_working_directory}
output_directory: ${base_output_directory}/results/${instance_id}
batsim_command: batsim -p ${platform[filename]} -w ${workload[filename]} -e ${output_directory}/out -m ${platform[master_host]}
sched_command: batsched -v ${algo[sched_name]} --variant_options_filepath ${output_directory}/sched_input.json
commands_before_execution:
# Generate sched input
- |
#!/usr/bin/env bash
cat > ${output_directory}/sched_input.json << EOF
{
"nb_kills_per_job": ${nb_kills_per_job},
"delay_before_kill": ${delay_before_kill}
}
EOF
commands_before_instances:
- ${batsim_dir}/test/is_batsim_dir.py ${base_working_directory}
- ${batsim_dir}/test/clean_output_dir.py ${base_output_directory}
Instances fail when jobs are killed more than once:
Should be fixed in 1817fb5. Can you confirm?
Hi, fixed, Thanks !
To reproduce the bug
You can run my scheduler:
and batsim
The cluster:
And you can use this workload
http://github.com/adfaure/ea2868ce9c152d590573bb778d767b7ehttps://gist.github.com/adfaure/ea2868ce9c152d590573bb778d767b7e