oar-team / batsim

Batsim: Infrastructure simulator for job and I/O scheduling
GNU Lesser General Public License v3.0
30 stars 15 forks source link

Using the PFS might lead to segfaults #31

Closed mpoquet closed 7 years ago

mpoquet commented 7 years ago

Investigating.

mpoquet commented 7 years ago

I can reproduce the issue. Data to reproduce: https://gist.github.com/mpoquet/b9a904acdad3e5b43351d9e9acbb9eff

Command

${BATSIM_DIR}/tools/experiments/execute_instances.py issue31.yaml

Output

2017-05-23 14:51:02,392 INFO: Base working directory: /home/carni/proj/test/batissue31
2017-05-23 14:51:02,392 INFO: Base output directory: /tmp/batissue31
2017-05-23 14:51:02,397 INFO: Worker (localhost,0) got 65686a9a ({'platform': {'name': 'small_pfs0', 'filename': '${test_dir}/small_platform_pfs0.xml'}, 'workload': {'name': 'simple_pfs0', 'filename': '${test_dir}/simple_workload_pfs0.json'}, 'explicit': False, 'instance_name': 'implicit'})
2017-05-23 14:51:02,397 INFO: Worker (localhost,0) runs 65686a9a
2017-05-23 14:51:02,613 ERROR: Worker (localhost,0) finished 65686a9a (returncode=2).
2017-05-23 14:51:02,613 INFO: 

----- begin of instance 65686a9a log -----
2017-05-23 14:51:02,619 ERROR: Instance 65686a9a stdout:
2017-05-23 14:51:02,519 INFO: Variables = {'base_output_directory': '/tmp/batissue31', 'base_working_directory': '/home/carni/proj/test/batissue31', 'instance_id': '65686a9a', 'instance_number': 0, 'platform': {'filename': '${test_dir}/small_platform_pfs0.xml', 'name': 'small_pfs0'}, 'socket_port': '$((${instance_number} + 28000))', 'test_dir': '${base_working_directory}', 'workload': {'filename': '${test_dir}/simple_workload_pfs0.json', 'name': 'simple_pfs0'}, 'working_directory': '/home/carni/proj/test/batissue31', 'output_directory': '/tmp/batissue31/results/_simple_pfs0_small_pfs0'}
2017-05-23 14:51:02,519 INFO: Working directory: /home/carni/proj/test/batissue31
2017-05-23 14:51:02,519 INFO: Output directory: /tmp/batissue31/results/_simple_pfs0_small_pfs0
2017-05-23 14:51:02,520 INFO: Batsim command: "batsim -p ${platform[filename]} -w ${workload[filename]} -e ${output_directory}/out --mmax-workload --batexec"
2017-05-23 14:51:02,521 INFO: Running Batsim
2017-05-23 14:51:02,585 ERROR: Batsim finished (returncode=139)
2017-05-23 14:51:02,594 ERROR: Batsim stderr:
[0.000000] [batsim/INFO] Workload '8a9444' corresponds to workload file '/home/carni/proj/test/batissue31/simple_workload_pfs0.json'.
[0.000000] [workload/INFO] Loading JSON workload '/home/carni/proj/test/batissue31/simple_workload_pfs0.json'...
[0.000000] [workload/INFO] JSON workload parsed sucessfully. Read 2 jobs and 2 profiles.
[0.000000] [workload/INFO] Checking workload validity...
[0.000000] [workload/INFO] Workload seems to be valid.
[0.000000] [batsim/INFO] The maximum number of machines to use is 4.
[0.000000] [batsim/INFO] Checking whether SMPI is used or not...
[0.000000] [batsim/INFO] SMPI will NOT be used.
[0.000000] [machines/INFO] Creating the machines from platform file '/home/carni/proj/test/batissue31/small_platform_pfs0.xml'...
[0.000000] [machines/INFO] The name of the master host is 'master_host'
[0.000000] [machines/INFO] The name of the parallel file system host is 'pfs_host'
[0.000000] [xbt_cfg/INFO] Switching to the L07 model to handle parallel tasks.
[0.000000] [machines/INFO] There is not Pfs_Host (parallel filesystem host).
[0.000000] [machines/INFO] The machines have been created successfully. There are 4 computing machines.
[0.000000] [batsim/INFO] Batsim's export prefix is '/tmp/batissue31/results/_simple_pfs0_small_pfs0/out'.
[0.000000] [batsim/INFO] The process 'workload_submitter_8a9444' has been created.
[Bourassa:job8a9444!1:(2) 0.000000] [jobs_execution/INFO] Creating task 'p 1'1''
[Bourassa:job8a9444!1:(2) 0.000000] [jobs_execution/INFO] Executing task 'p 1'1''
Segmentation fault.
/tmp/batissue31/results/_simple_pfs0_small_pfs0/batsim_command.sh : ligne 7 : 30005 Erreur de segmentation  (core dumped)batsim -p ${platform[filename]} -w ${workload[filename]} -e ${output_directory}/out --mmax-workload --batexec

2017-05-23 14:51:02,623 INFO: ----- end of instance 65686a9a log -----

2017-05-23 14:51:02,623 INFO: Worker (localhost,0) finished
2017-05-23 14:51:02,628 INFO: Number of successfully executed instances: 0
2017-05-23 14:51:02,628 WARNING: Number of skipped instances: 1
2017-05-23 14:51:02,628 WARNING: Information about these instances can be found in file /tmp/batissue31/instances/instances_info.csv
mpoquet commented 7 years ago

Wrong hosts were used to compute the job, resulting in inconsistent number of hosts and hosts themselves, leading to SG crash. Fixed in 0ab9b70.

Adding a PFS test then closing the issue.

mpoquet commented 7 years ago

Added. Using commit 8a59395 should work.