Closed mpoquet closed 7 years ago
Probably related to issue #2, but this behaviour might also explicit some kind of buffer overflow which only occurs in this execution context.
Tried d437b38 as written here, and cannot reproduce the bug :
pfdutot@malihini:/tmp/test/batsim$ ./test/run_tests.sh
+ rm -rf test/out/instance_examples
+ rm -rf test/out/unique
+ rm -rf test/out/no_energy
+ rm -rf test/out/space_sharing
+ rm -rf test/out/energy
+ server_launched_by_me=0
++ ps faux
++ grep -v grep
++ wc -l
++ grep redis-server
+ r=1
+ '[' 1 -eq 0 ']'
+ tools/experiments/execute_one_instance.py -od test/out/instance_examples/pftiny ./tools/experiments/instance_examples/pybatsim_filler_tiny.yaml
Traceback (most recent call last):
File "tools/experiments/execute_one_instance.py", line 729, in <module>
main()
File "tools/experiments/execute_one_instance.py", line 635, in main
batsim_command)
File "tools/experiments/execute_one_instance.py", line 105, in retrieve_info_from_instance
cwd = working_directory)
File "/usr/local/lib/python2.7/dist-packages/execo/process.py", line 957, in __init__
super(Process, self).__init__(cmd, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'cwd'
It seems that, in the Execo version you used, the Process class does not handle the cwd argument. Can you try to use https://github.com/mickours/execo instead of the main Execo repository?
Now, using the commit 92d7a95856a6e3d1df624d3c969d68181a3a9b1e I cannot reproduce the bug : all tests are running correctly.
Actually, using valgrind I find one case where it fails.
Did you remove the -q
option from the batsim commands?
The problem still appears on commit 95b5e644 on my machine and on Travis :(. However, for some reason (fixing #2 ?), Batsim's output is now shown on Travis's output !
The number of jobs that can be read before the black hole depends on the number of characters printed in the different XBT_LOG calls.
This issue seems to be deterministic, as the black hole always appears after the same number of jobs (on a fixed commit) on my machine, and this number equals to travis's one.
If Batsim and the scheduler are run manually, the problem does not appear on my machine. Can you try it on yours?
I will try investigating with valgrind, as it can now be executed by the python script!
git checkout e26e9f0af
cd ${BATSIM_ROOT_DIR}
./test/run_test.sh # Will fail, wait for the fail to happen
./test/out/instance_examples/pfmedium/batsim_command.sh
./test/out/instance_examples/pfmedium/sched_command.sh
Fixed by commit bb00d9a224, many thanks!
Travis build after the merge: https://travis-ci.org/oar-team/batsim/builds/160408955. Travis build after removing the '-q' options from the test scripts: https://travis-ci.org/oar-team/batsim/builds/160410385.
Description
Batsim gets stuck into a blackhole under certain circumstances:
Steps to reproduce
./test/run_tests.sh
Problems
Batsim gets stuck before opening the socket, so the experiment script waits for timeout.
Typically, Batsim gets stuck while it is reading the workload. The number of jobs read depends on the number of characters printed by XBT_INFO during each job...
When Batsim is executed this way, Batsim's stdout and stderr are not given back to the python execution tools., which makes debugging this issue quite annoying.