oar-team / batsim

Batsim: Infrastructure simulator for job and I/O scheduling
GNU Lesser General Public License v3.0
30 stars 15 forks source link

Assertion data->nb_running_jobs >= 0 failed #36

Closed adfaure closed 7 years ago

adfaure commented 7 years ago

To reproduce the bug

You can run my scheduler:

mkdir rust ; cd rust

git clone https://gitlab.inria.fr/adfaure/procset.rs
git clone https://gitlab.inria.fr/adfaure/bat-rust rustbatsim
git clone https://gitlab.inria.fr/adfaure/schedulers

#Activate logs
export RUST_LOG=nodegrp=trace

cd schedulers; cargo run --bin nodegrp

and batsim

./batsim -p ../platforms/clusterxxx.xml -m master_host0   -w ../../red-sched/traces/curie_1w_43659000.json --config-file ../../rustbs/schedulers/configurations/default.json

The cluster:

<?xml version='1.0'?>
<!DOCTYPE platform SYSTEM "http://simgrid.gforge.inria.fr/simgrid/simgrid.dtd">
<platform version="4">

<AS id="AS0" routing="Full">
    <cluster id="my_cluster_1" prefix="a" suffix="" radical="0-50"
        speed="1Gf" bw="125MBps" lat="50us" bb_bw="2.25GBps"
        bb_lat="500us" />

    <cluster id="my_cluster_2" prefix="master_host" suffix="" radical="0-0"
        speed="1Gf" bw="125MBps" lat="50us" bb_bw="2.25GBps"
        bb_lat="500us" />

    <link id="backbone" bandwidth="1.25GBps" latency="500us" />

    <ASroute src="my_cluster_1" dst="my_cluster_2" gw_src="amy_cluster_1_router"
        gw_dst="master_hostmy_cluster_2_router">
        <link_ctn id="backbone" />
    </ASroute>
</AS>
</platform>

And you can use this workload http://github.com/adfaure/ea2868ce9c152d590573bb778d767b7e https://gist.github.com/adfaure/ea2868ce9c152d590573bb778d767b7e

{
   "redis": {                  
     "enabled": false,         
     "hostname": "127.0.0.1",
     "port": 6379,             
     "prefix": "default"       
   },
   "job_submission": {         
     "forward_profiles": true, 
     "from_scheduler": {       
       "enabled": true,        
       "acknowledge": true     
     }
   }
 }
mpoquet commented 7 years ago

Cannot reproduce the issue.

Your scheduler does not seem to handle homogeneous MSG jobs, the provided workload is probably wrong.

Yaml to reproduce:

base_output_directory: /tmp/batsim_tests/issue36

base_variables:
  batsim_dir: ${base_working_directory}

implicit_instances:
  implicit:
    sweep:
      platform :
        - {"name":"cluster", "filename":"${batsim_dir}/platforms/cluster_issue36.xml", "master_host":"master_host0"}
      workload :
        - {"name":"issue36", "filename": "${batsim_dir}/workload_profiles/issue36.json"}
        #- {"name":"medium", "filename": "${batsim_dir}/workload_profiles/batsim_paper_workload_example.json"}
        #- {"name":"tiny", "filename": "${batsim_dir}/workload_profiles/one_delay_job.json"}
      algo:
        - {"name":"nodegrp", "algo_name":"nodegrp"}
    generic_instance:
      timeout: 60
      working_directory: ${base_working_directory}
      output_directory: ${base_output_directory}/results/${algo[name]}_${workload[name]}_${platform[name]}
      batsim_command: batsim -p ${platform[filename]} -w ${workload[filename]} -e ${output_directory}/out --config ${output_directory}/batsim.conf -m ${platform[master_host]}
      sched_command: cd /home/carni/proj/rust/schedulers && RUST_BACKTRACE=1 cargo run --bin ${algo[algo_name]}

      commands_before_execution:
        # Generate Batsim config file
        - |
              #!/usr/bin/env bash
              cat > ${output_directory}/batsim.conf << EOF
              {
                "job_submission": {
                  "forward_profiles": true,
                  "from_scheduler":{
                    "enabled": true,
                    "acknowledge": true
                  }
                }
              }
              EOF

commands_before_instances:
  - ${batsim_dir}/test/is_batsim_dir.py ${base_working_directory}
  - ${batsim_dir}/test/clean_output_dir.py ${base_output_directory}

Output:

2017-06-07 18:48:34,176 INFO: Variables = {'algo': {'algo_name': 'nodegrp', 'name': 'nodegrp'}, 'base_output_directory': '/tmp/batsim_tests/issue36', 'base_working_directory': '/home/carni/proj/batsim', 'batsim_dir': '${base_working_directory}', 'instance_id': '402b9074', 'instance_number': 0, 'platform': {'filename': '${batsim_dir}/platforms/cluster_issue36.xml', 'master_host': 'master_host0', 'name': 'cluster'}, 'workload': {'filename': '${batsim_dir}/workload_profiles/issue36.json', 'name': 'issue36'}, 'working_directory': '/home/carni/proj/batsim', 'output_directory': '/tmp/batsim_tests/issue36/results/nodegrp_issue36_cluster'}
2017-06-07 18:48:34,176 INFO: Working directory: /home/carni/proj/batsim
2017-06-07 18:48:34,176 INFO: Output directory: /tmp/batsim_tests/issue36/results/nodegrp_issue36_cluster
2017-06-07 18:48:34,176 INFO: Executing command 'command0'
2017-06-07 18:48:34,184 INFO: command0 finished
2017-06-07 18:48:34,186 INFO: Batsim command: "batsim -p ${platform[filename]} -w ${workload[filename]} -e ${output_directory}/out --config ${output_directory}/batsim.conf -m ${platform[master_host]}"
2017-06-07 18:48:34,186 INFO: Sched command: "cd /home/carni/proj/rust/schedulers && RUST_BACKTRACE=1 cargo run --bin ${algo[algo_name]}"
2017-06-07 18:48:34,186 INFO: Waiting for socket 'tcp://localhost:28000' to be usable
2017-06-07 18:48:34,193 INFO: Socket tcp://localhost:28000 is now usable
2017-06-07 18:48:34,194 INFO: Running Batsim and Sched
2017-06-07 18:48:35,721 ERROR: Sched finished (returncode=101)
2017-06-07 18:48:35,727 ERROR: Sched stderr:
    Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs
     Running `target/debug/nodegrp`
thread 'main' panicked at 'ErrorImpl { code: Message("missing field `delay`"), line: 1, column: 260 } full str: {"now":0.002400,"events":[{"timestamp":0.001800,"type":"JOB_SUBMITTED","data":{"job_id":"038390!35383","job":{"profile":"8","res":2,"id":"038390!35383","subtime":0,"walltime":1800.000000},"profile":{"com":0.000000,"type":"msg_par_hg","cpu":800000000.000000}}}]}', /home/carni/proj/rust/rustbatsim/src/libbatsim/batsim.rs:430
stack backtrace:
   0: std::sys::imp::backtrace::tracing::imp::unwind_backtrace
   1: std::sys_common::backtrace::_print
   2: std::panicking::default_hook::{{closure}}
   3: std::panicking::default_hook
   4: std::panicking::rust_panic_with_hook
   5: std::panicking::begin_panic
   6: std::panicking::begin_panic_fmt
   7: batsim::batsim::read_batsim_message
             at /home/carni/proj/rust/rustbatsim/src/libbatsim/batsim.rs:430
   8: batsim::batsim::Batsim::get_next_message
             at /home/carni/proj/rust/rustbatsim/src/libbatsim/batsim.rs:265
   9: batsim::batsim::Batsim::run_simulation
             at /home/carni/proj/rust/rustbatsim/src/libbatsim/batsim.rs:297
  10: nodegrp::main
             at ./nodegroup/src/main.rs:21
  11: std::panicking::try::do_call
  12: __rust_maybe_catch_panic
  13: std::rt::lang_start
  14: main
  15: __libc_start_main
  16: _start

2017-06-07 18:48:35,738 ERROR: Killing remaining processes {1824, 1778, 1780, 1823}
adfaure commented 7 years ago

Oh yes indeed, I forgot to push batsim's last revesion...

Le 7 juin 2017 18:49, "mpoquet" notifications@github.com a écrit :

Cannot reproduce the issue.

Your scheduler does not seem to handle homogeneous MSG jobs, the provided workload is probably wrong.

Yaml to reproduce:

base_output_directory: /tmp/batsim_tests/issue36 base_variables: batsim_dir: ${base_working_directory} implicit_instances: implicit: sweep: platform :

  • {"name":"cluster", "filename":"${batsim_dir}/platforms/cluster_issue36.xml", "master_host":"master_host0"} workload :
  • {"name":"issue36", "filename": "${batsim_dir}/workload_profiles/issue36.json"}

    - {"name":"medium", "filename": "${batsim_dir}/workload_profiles/batsim_paper_workload_example.json"}

    #- {"name":"tiny", "filename": "${batsim_dir}/workload_profiles/one_delay_job.json"}

    algo:

  • {"name":"nodegrp", "algo_name":"nodegrp"} generic_instance: timeout: 60 working_directory: ${base_working_directory} output_directory: ${base_outputdirectory}/results/${algo[name]}${workload[name]}_${platform[name]} batsim_command: batsim -p ${platform[filename]} -w ${workload[filename]} -e ${output_directory}/out --config ${output_directory}/batsim.conf -m ${platform[master_host]} sched_command: cd /home/carni/proj/rust/schedulers && RUST_BACKTRACE=1 cargo run --bin ${algo[algo_name]}

    commands_before_execution:

    Generate Batsim config file

  • | #!/usr/bin/env bash cat > ${output_directory}/batsim.conf << EOF { "job_submission": { "forward_profiles": true, "from_scheduler":{ "enabled": true, "acknowledge": true } } } EOF commands_before_instances:
    • ${batsim_dir}/test/is_batsim_dir.py ${base_working_directory}
    • ${batsim_dir}/test/clean_output_dir.py ${base_output_directory}

Output:

2017-06-07 18:48:34,176 INFO: Variables = {'algo': {'algo_name': 'nodegrp', 'name': 'nodegrp'}, 'base_output_directory': '/tmp/batsim_tests/issue36', 'base_working_directory': '/home/carni/proj/batsim', 'batsim_dir': '${base_working_directory}', 'instance_id': '402b9074', 'instance_number': 0, 'platform': {'filename': '${batsim_dir}/platforms/cluster_issue36.xml', 'master_host': 'master_host0', 'name': 'cluster'}, 'workload': {'filename': '${batsim_dir}/workload_profiles/issue36.json', 'name': 'issue36'}, 'working_directory': '/home/carni/proj/batsim', 'output_directory': '/tmp/batsim_tests/issue36/results/nodegrp_issue36_cluster'} 2017-06-07 18:48:34,176 INFO: Working directory: /home/carni/proj/batsim 2017-06-07 18:48:34,176 INFO: Output directory: /tmp/batsim_tests/issue36/results/nodegrp_issue36_cluster 2017-06-07 18:48:34,176 INFO: Executing command 'command0' 2017-06-07 18:48:34,184 INFO: command0 finished 2017-06-07 18:48:34,186 INFO: Batsim command: "batsim -p ${platform[filename]} -w ${workload[filename]} -e ${output_directory}/out --config ${output_directory}/batsim.conf -m ${platform[master_host]}" 2017-06-07 18:48:34,186 INFO: Sched command: "cd /home/carni/proj/rust/schedulers && RUST_BACKTRACE=1 cargo run --bin ${algo[algo_name]}" 2017-06-07 18:48:34,186 INFO: Waiting for socket 'tcp://localhost:28000' to be usable 2017-06-07 18:48:34,193 INFO: Socket tcp://localhost:28000 is now usable 2017-06-07 18:48:34,194 INFO: Running Batsim and Sched 2017-06-07 18:48:35,721 ERROR: Sched finished (returncode=101) 2017-06-07 18:48:35,727 ERROR: Sched stderr: Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs Running target/debug/nodegrp thread 'main' panicked at 'ErrorImpl { code: Message("missing field delay"), line: 1, column: 260 } full str: {"now":0.002400,"events":[{"timestamp":0.001800,"type":"JOB_SUBMITTED","data":{"job_id":"038390!35383","job":{"profile":"8","res":2,"id":"038390!35383","subtime":0,"walltime":1800.000000},"profile":{"com":0.000000,"type":"msg_par_hg","cpu":800000000.000000}}}]}', /home/carni/proj/rust/rustbatsim/src/libbatsim/batsim.rs:430 stack backtrace: 0: std::sys::imp::backtrace::tracing::imp::unwind_backtrace 1: std::sys_common::backtrace::_print 2: std::panicking::default_hook::{{closure}} 3: std::panicking::default_hook 4: std::panicking::rust_panic_with_hook 5: std::panicking::begin_panic 6: std::panicking::begin_panic_fmt 7: batsim::batsim::read_batsim_message at /home/carni/proj/rust/rustbatsim/src/libbatsim/batsim.rs:430 8: batsim::batsim::Batsim::get_next_message at /home/carni/proj/rust/rustbatsim/src/libbatsim/batsim.rs:265 9: batsim::batsim::Batsim::run_simulation at /home/carni/proj/rust/rustbatsim/src/libbatsim/batsim.rs:297 10: nodegrp::main at ./nodegroup/src/main.rs:21 11: std::panicking::try::do_call 12: rust_maybe_catch_panic 13: std::rt::lang_start 14: main 15: libc_start_main 16: _start

2017-06-07 18:48:35,738 ERROR: Killing remaining processes {1824, 1778, 1780, 1823}

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/oar-team/batsim/issues/36#issuecomment-306855766, or mute the thread https://github.com/notifications/unsubscribe-auth/AHp52u_a0rvyd4snQTnOs6V68hANIXu9ks5sBtScgaJpZM4Ny6P7 .

adfaure commented 7 years ago

It should be reproducible now :)

mpoquet commented 7 years ago

The scheduler seems to fall into an infinite loop after displaying the following message :(.

[SUBMIT_JOB { timestamp: 33407.0024, data: SubmitJob { job_id: "rej!37296", job: Job { id: "rej!37296", res: 1, profile: "11398", subtime: 32467, walltime: 86400 }, profile: Some(MsgParHg { com: 0, cpu: 1139800000000 }) } }]

adfaure commented 7 years ago

I still have this behavior, did you manage to reproduce ?

mpoquet commented 7 years ago

The scheduler is still going into an infinite loop on my laptop :(.

Batsim output

[...]
[master_host0:Scheduler REQ-REP:(890) 33407.002400] [network/INFO] Sending '{"now":33407.002400,"events":[{"timestamp":33407.002400,"type":"JOB_KILLED","data":{"job_ids":["038390!37296"]}}]}'
[master_host0:Scheduler REQ-REP:(890) 33407.002400] [network/INFO] Received '{"now":33407.0024,"events":[{"type":"SUBMIT_JOB","timestamp":33407.0024,"data":{"job_id":"rej!37296","job":{"id":"rej!37296","res":1,"profile":"11398","subtime":32467.0,"walltime":86400.0},"profile":{"type":"msg_par_hg","com":0.0,"cpu":1139800000000.0}}},{"type":"EXECUTE_JOB","timestamp":33407.0006,"data":{"job_id":"038390!37297","alloc":"1-1"}}]}'
[master_host0:server:(2) 33407.003000] [server/INFO] Server received a message of type JOB_SUBMITTED_BY_DP:
[master_host0:server:(2) 33407.003000] [server/INFO] Parsing user-submitted job rej!37296
[master_host0:server:(2) 33407.003000] [server/INFO] The profile of user-submitted job '11398' does not exist yet.
[master_host0:server:(2) 33407.003600] [server/INFO] Server received a message of type SCHED_EXECUTE_JOB:
[a1:job_038390!37297:(891) 33407.003600] [jobs_execution/INFO] Creating task 'phg 37297'10832''
[a1:job_038390!37297:(891) 33407.003600] [jobs_execution/INFO] Executing task 'phg 37297'10832''
[master_host0:server:(2) 33407.004200] [server/INFO] Server received a message of type SCHED_READY:
[master_host0:Scheduler REQ-REP:(892) 33407.004200] [network/INFO] Sending '{"now":33407.004200,"events":[{"timestamp":33407.003000,"type":"JOB_SUBMITTED","data":{"job_id":"rej!37296","job":{"id":"rej!37296","res":1,"profile":"11398","subtime":32467.000000,"walltime":86400.000000},"profile":{"type":"msg_par_hg","com":0.000000,"cpu":1139800000000.000000}}}]}'
adfaure commented 7 years ago

This is bizarre ...

Anyway, I think I found the issue. It happens if a job is killed multiples times before batsim acknowledge the kill. Because batsim seems to alway decrement the number of job running.

I confirm I add the check on my scheduler, it seems to work now.

mpoquet commented 7 years ago

Thanks for the information, I'll try to reproduce this behaviour.

mpoquet commented 7 years ago

Can reproduce the issue with:

base_output_directory: /tmp/batsim_tests/issue36
base_variables:
  batsim_dir: ${base_working_directory}
implicit_instances:
  implicit:
    sweep:
      platform :
        - {"name":"small", "filename":"${batsim_dir}/platforms/small_platform.xml", "master_host":"master_host"}
      workload :
        - {"name":"tiny", "filename": "${batsim_dir}/workload_profiles/one_delay_job.json"}
      algo:
        - {"name":"killer", "sched_name":"killer"}
      delay_before_kill: [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]
      nb_kills_per_job: [0,1,2,3]
    generic_instance:
      timeout: 60
      working_directory: ${base_working_directory}
      output_directory: ${base_output_directory}/results/${instance_id}
      batsim_command: batsim -p ${platform[filename]} -w ${workload[filename]} -e ${output_directory}/out -m ${platform[master_host]}
      sched_command: batsched -v ${algo[sched_name]} --variant_options_filepath ${output_directory}/sched_input.json
      commands_before_execution:
        # Generate sched input
        - |
              #!/usr/bin/env bash
              cat > ${output_directory}/sched_input.json << EOF
              {
                "nb_kills_per_job": ${nb_kills_per_job},
                "delay_before_kill": ${delay_before_kill}
              }
              EOF
commands_before_instances:
  - ${batsim_dir}/test/is_batsim_dir.py ${base_working_directory}
  - ${batsim_dir}/test/clean_output_dir.py ${base_output_directory}

Instances fail when jobs are killed more than once: instances

mpoquet commented 7 years ago

Should be fixed in 1817fb5. Can you confirm?

adfaure commented 7 years ago

Hi, fixed, Thanks !