Closed amv closed 10 years ago
I think this is starting to be ready for merging. I am running the following command in the vagrant machine and seem to getting fully passing runs on more than 9 out of 10 times:
for N in $(seq 200); do
pkill -9 gearmand; pkill -9 node; sleep 5;
pkill gearmand; pkill node; sleep 5;
echo "running $N...";
make test 2>&1 |tee $( printf "test.make.%04d.log" $N );
done
http://media.dicole.com/gearsloth/long_make_test_run_1.zip
The above zip contains the outputs of the previously mentioned N=200 test runs. It showcases mostly passing tests and also a couple of examples on how the tests fail, mostly when the system as a whole finds itself in a weird state.
You can find the failing runs easily by looking at the file sizes of the logs: 19, 20, 21, 104, 105 and 157. Of these 157 is a weird one as it does not match the common pattern.
The common pattern seems to be that for some periods of time that last up to 5 minutes, the whole system is in a state where processes can not connect properly to the gearmand (so the test fail in the setup()). I have seen that the gearmand processes do spin up but after the first failure there are several gearmand processes running, trying to listen to the same port. Killing these processes and all node processes (successfully) apparently does not seem to solve the system wide problem as the next test runs still behave in the same manner. This would indicate that the problem lies within the virtual container.
One hypothesis would be that the virtual container is deprived of some resource which is freed periodically. Unfortunately there are no indicators of anything like this in the system logs, so one would have to start monitoring changes in the /proc to know which resource exhibits odd behaviour during the problem states.
Any ideas on how to defend against the weird states would be welcome, but all in all the situation is now much better than before with only 6 runs out of 200 experiencing problems (when run in the Vagrant box with a 2013 Macbook Air).
As a summary the main culprits that caused the test failures were:
EDIT: This comment contained a false hypothesis and is now redacted :D