oar-team / batsim

Batsim: Infrastructure simulator for job and I/O scheduling
GNU Lesser General Public License v3.0
30 stars 15 forks source link

XBT_INFO blackhole #1

Closed mpoquet closed 7 years ago

mpoquet commented 8 years ago

Description

Batsim gets stuck into a blackhole under certain circumstances:

Batsim gets stuck before opening the socket, so the experiment script waits for timeout.

Typically, Batsim gets stuck while it is reading the workload. The number of jobs read depends on the number of characters printed by XBT_INFO during each job...

When Batsim is executed this way, Batsim's stdout and stderr are not given back to the python execution tools., which makes debugging this issue quite annoying.

mpoquet commented 8 years ago

Probably related to issue #2, but this behaviour might also explicit some kind of buffer overflow which only occurs in this execution context.

pfdutot commented 8 years ago

Tried d437b38 as written here, and cannot reproduce the bug :

pfdutot@malihini:/tmp/test/batsim$ ./test/run_tests.sh 
+ rm -rf test/out/instance_examples
+ rm -rf test/out/unique
+ rm -rf test/out/no_energy
+ rm -rf test/out/space_sharing
+ rm -rf test/out/energy
+ server_launched_by_me=0
++ ps faux
++ grep -v grep
++ wc -l
++ grep redis-server
+ r=1
+ '[' 1 -eq 0 ']'
+ tools/experiments/execute_one_instance.py -od test/out/instance_examples/pftiny ./tools/experiments/instance_examples/pybatsim_filler_tiny.yaml
Traceback (most recent call last):
  File "tools/experiments/execute_one_instance.py", line 729, in <module>
    main()
  File "tools/experiments/execute_one_instance.py", line 635, in main
    batsim_command)
  File "tools/experiments/execute_one_instance.py", line 105, in retrieve_info_from_instance
    cwd = working_directory)
  File "/usr/local/lib/python2.7/dist-packages/execo/process.py", line 957, in __init__
    super(Process, self).__init__(cmd, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'cwd'
mpoquet commented 8 years ago

It seems that, in the Execo version you used, the Process class does not handle the cwd argument. Can you try to use https://github.com/mickours/execo instead of the main Execo repository?

pfdutot commented 8 years ago

Now, using the commit 92d7a95856a6e3d1df624d3c969d68181a3a9b1e I cannot reproduce the bug : all tests are running correctly.

Actually, using valgrind I find one case where it fails.

mpoquet commented 7 years ago

Did you remove the -q option from the batsim commands?

The problem still appears on commit 95b5e644 on my machine and on Travis :(. However, for some reason (fixing #2 ?), Batsim's output is now shown on Travis's output !

The number of jobs that can be read before the black hole depends on the number of characters printed in the different XBT_LOG calls.

This issue seems to be deterministic, as the black hole always appears after the same number of jobs (on a fixed commit) on my machine, and this number equals to travis's one.

If Batsim and the scheduler are run manually, the problem does not appear on my machine. Can you try it on yours?

I will try investigating with valgrind, as it can now be executed by the python script!

How to run the processes manually

Setup and commands' generation

git checkout e26e9f0af
cd ${BATSIM_ROOT_DIR}
./test/run_test.sh # Will fail, wait for the fail to happen

Run Batsim

./test/out/instance_examples/pfmedium/batsim_command.sh

Run the scheduler (in another terminal)

./test/out/instance_examples/pfmedium/sched_command.sh
mpoquet commented 7 years ago

Fixed by commit bb00d9a224, many thanks!

Travis build after the merge: https://travis-ci.org/oar-team/batsim/builds/160408955. Travis build after removing the '-q' options from the test scripts: https://travis-ci.org/oar-team/batsim/builds/160410385.