oar-team / batsim

Batsim: Infrastructure simulator for job and I/O scheduling
GNU Lesser General Public License v3.0
30 stars 15 forks source link

Batsim deadlocks on kill #37

Closed mpoquet closed 7 years ago

mpoquet commented 7 years ago

It seems that Batsim deadlocks under some conditions when jobs are killed.

Versions

Yaml to reproduce:

(all files are not available on the repo)

# If needed, the output directory of this script can be specified within this file
base_output_directory: /tmp/batsim_tests/issue37

base_variables:
  batsim_dir: ${base_working_directory}

implicit_instances:
  implicit:
    sweep:
      platform :
        - {"name":"cluster", "filename":"${batsim_dir}/platforms/cluster_issue36.xml", "master_host":"master_host0"}
      workload :
        - {"name":"tiny", "filename": "${batsim_dir}/workload_profiles/one_delay_job.json"}
      algo:
        - {"name":"killer", "sched_name":"killer"}
    generic_instance:
      timeout: 60
      working_directory: ${base_working_directory}
      output_directory: ${base_output_directory}/results/${algo[name]}_${workload[name]}_${platform[name]}
      batsim_command: batsim -p ${platform[filename]} -w ${workload[filename]} -e ${output_directory}/out --config ${output_directory}/batsim.conf -m ${platform[master_host]}
      sched_command: batsched -v ${algo[sched_name]} --variant_options_filepath ${output_directory}/sched_input.json

      commands_before_execution:
        # Generate Batsim config file
        - |
              #!/usr/bin/env bash
              cat > ${output_directory}/batsim.conf << EOF
              {
                "job_submission": {
                  "forward_profiles": true,
                  "from_scheduler":{
                    "enabled": true,
                    "acknowledge": true
                  }
                }
              }
              EOF
        # Generate sched input
        - |
              #!/usr/bin/env bash
              cat > ${output_directory}/sched_input.json << EOF
              {
                "nb_kills_per_job": 1,
                "delay_before_kill": 10
              }
              EOF

commands_before_instances:
  - ${batsim_dir}/test/is_batsim_dir.py ${base_working_directory}
  - ${batsim_dir}/test/clean_output_dir.py ${base_output_directory}
mpoquet commented 7 years ago

Does not seem to depend on the platform nor the workload.

mpoquet commented 7 years ago

Ahem... Dynamic submissions should not be allowed with this scheduler. This is the reason why the deadlock occurs.