meetings / gearsloth

Gearman job persistence and delayed execution system
MIT License
2 stars 0 forks source link

Fixing problems in make test #63

Closed amv closed 10 years ago

amv commented 10 years ago

EDIT: This comment contained a false hypothesis and is now redacted :D

amv commented 10 years ago

I think this is starting to be ready for merging. I am running the following command in the vagrant machine and seem to getting fully passing runs on more than 9 out of 10 times:

for N in $(seq 200); do
    pkill -9 gearmand; pkill -9 node; sleep 5;
    pkill gearmand; pkill node; sleep 5;
    echo "running $N...";
    make test 2>&1 |tee $( printf "test.make.%04d.log" $N );
done
amv commented 10 years ago

http://media.dicole.com/gearsloth/long_make_test_run_1.zip

The above zip contains the outputs of the previously mentioned N=200 test runs. It showcases mostly passing tests and also a couple of examples on how the tests fail, mostly when the system as a whole finds itself in a weird state.

You can find the failing runs easily by looking at the file sizes of the logs: 19, 20, 21, 104, 105 and 157. Of these 157 is a weird one as it does not match the common pattern.

The common pattern seems to be that for some periods of time that last up to 5 minutes, the whole system is in a state where processes can not connect properly to the gearmand (so the test fail in the setup()). I have seen that the gearmand processes do spin up but after the first failure there are several gearmand processes running, trying to listen to the same port. Killing these processes and all node processes (successfully) apparently does not seem to solve the system wide problem as the next test runs still behave in the same manner. This would indicate that the problem lies within the virtual container.

One hypothesis would be that the virtual container is deprived of some resource which is freed periodically. Unfortunately there are no indicators of anything like this in the system logs, so one would have to start monitoring changes in the /proc to know which resource exhibits odd behaviour during the problem states.

Any ideas on how to defend against the weird states would be welcome, but all in all the situation is now much better than before with only 6 runs out of 200 experiencing problems (when run in the Vagrant box with a 2013 Macbook Air).

As a summary the main culprits that caused the test failures were:

  1. Ejector e2e test failed to disconnect the used ejectors, which then where left lingering and caused later unrelated tests to randomly fail to ECONNRESET erros from net.js. Some further inspection into this would still be warranted to determine why the on('error') handlers of the ejectors are not present if the objects are left lingering.
  2. The sqlite3 initialization code did not defend against multiple simultaneously spawned processes trying to ensure that the database table is created, which caused processes to exit. The processes also allowed the other code to execute before the initialization phase was actually done, causing the system to seemingly work for one second, until the CREATE TABLE lock wait timeout was reached and the process exited.