XBT_INFO blackhole - Githubissues

mpoquet commented 8 years ago

Description

Batsim gets stuck into a blackhole under certain circumstances:

Batsim must be executed by the experiment tools (does not happen when executed more directly)
Batsim must be executed in a non-quiet mode (does not happen in quiet mode)
This bug seems deterministic, but it only occurs on some input files!
Steps to reproduce
Use the data_storage branch, commit d437b38a64eee7a6fa8c9be for example
Remove the -q option from batsim_command in ./tools/experiments/instance_examples /pybatsim_filler_medium.yaml
Run the tests: ./test/run_tests.sh
Problems

Batsim gets stuck before opening the socket, so the experiment script waits for timeout.

Typically, Batsim gets stuck while it is reading the workload. The number of jobs read depends on the number of characters printed by XBT_INFO during each job...

When Batsim is executed this way, Batsim's stdout and stderr are not given back to the python execution tools., which makes debugging this issue quite annoying.

mpoquet commented 8 years ago

Probably related to issue #2, but this behaviour might also explicit some kind of buffer overflow which only occurs in this execution context.

pfdutot commented 8 years ago

Tried d437b38 as written here, and cannot reproduce the bug :

pfdutot@malihini:/tmp/test/batsim$ ./test/run_tests.sh 
+ rm -rf test/out/instance_examples
+ rm -rf test/out/unique
+ rm -rf test/out/no_energy
+ rm -rf test/out/space_sharing
+ rm -rf test/out/energy
+ server_launched_by_me=0
++ ps faux
++ grep -v grep
++ wc -l
++ grep redis-server
+ r=1
+ '[' 1 -eq 0 ']'
+ tools/experiments/execute_one_instance.py -od test/out/instance_examples/pftiny ./tools/experiments/instance_examples/pybatsim_filler_tiny.yaml
Traceback (most recent call last):
  File "tools/experiments/execute_one_instance.py", line 729, in <module>
    main()
  File "tools/experiments/execute_one_instance.py", line 635, in main
    batsim_command)
  File "tools/experiments/execute_one_instance.py", line 105, in retrieve_info_from_instance
    cwd = working_directory)
  File "/usr/local/lib/python2.7/dist-packages/execo/process.py", line 957, in __init__
    super(Process, self).__init__(cmd, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'cwd'

mpoquet commented 8 years ago

It seems that, in the Execo version you used, the Process class does not handle the cwd argument. Can you try to use https://github.com/mickours/execo instead of the main Execo repository?

pfdutot commented 8 years ago

Now, using the commit 92d7a95856a6e3d1df624d3c969d68181a3a9b1e I cannot reproduce the bug : all tests are running correctly.

Actually, using valgrind I find one case where it fails.

mpoquet commented 7 years ago

Did you remove the -q option from the batsim commands?

The problem still appears on commit 95b5e644 on my machine and on Travis :(. However, for some reason (fixing #2 ?), Batsim's output is now shown on Travis's output !

The number of jobs that can be read before the black hole depends on the number of characters printed in the different XBT_LOG calls.

In commit 95b5e644 (travis log: travis build 95b5e644), Batsim is stopped after loading 623 jobs.
In commit e26e9f0af (travis log: travis build e26e9f0af), Batsim is stopped after loading 338 jobs.

This issue seems to be deterministic, as the black hole always appears after the same number of jobs (on a fixed commit) on my machine, and this number equals to travis's one.

If Batsim and the scheduler are run manually, the problem does not appear on my machine. Can you try it on yours?

I will try investigating with valgrind, as it can now be executed by the python script!

How to run the processes manually

Setup and commands' generation

git checkout e26e9f0af
cd ${BATSIM_ROOT_DIR}
./test/run_test.sh # Will fail, wait for the fail to happen

Run Batsim

./test/out/instance_examples/pfmedium/batsim_command.sh

Run the scheduler (in another terminal)

./test/out/instance_examples/pfmedium/sched_command.sh

mpoquet commented 7 years ago

Fixed by commit bb00d9a224, many thanks!

Travis build after the merge: https://travis-ci.org/oar-team/batsim/builds/160408955. Travis build after removing the '-q' options from the test scripts: https://travis-ci.org/oar-team/batsim/builds/160410385.

oar-team / batsim

XBT_INFO blackhole #1

Description

Steps to reproduce

Problems

How to run the processes manually

Setup and commands' generation

Run Batsim

Run the scheduler (in another terminal)