project-bluebird / bluebird

Server to communicate with NATS air traffic simulator and Bluesky
MIT License
10 stars 4 forks source link

NUM_BLUEBIRD != number of bluebird instances that end up running #75

Closed tallamjr closed 4 years ago

tallamjr commented 5 years ago

It appears that running run-containers.sh found in the nats repo (https://github.com/alan-turing-institute/nats/blob/turing-away-day/away-day/run-containers.sh) would previously be able to be set to NUM_BLUEBIRD=3 and this would indeed launch 3 instances of bluebird with a mapping of the following:

0.0.0.0:5002->5001/tcp             bluebird-2
0.0.0.0:5001->5001/tcp             bluebird-1
0.0.0.0:5000->5001/tcp             bluebird-0
0.0.0.0:9000-9001->9000-9001/tcp   bluesky

However after testing today, this was no longer possible and would result in one instance always being 'killed'. To investigate which commit may have caused this issue, git bisect was used with the following script:

#!/bin/bash

# Build the image assocated on this commit hash
docker build . --tag turinginst/bluebird:away-day

# Run 'run-containers.sh' script with NUM_BLUEBIRD=5
( cd ../../nats/away-day/ && ./run-containers.sh )

function tear_down() {
    # Shutdown docker instances after tests
    ( cd ../../nats/away-day/ && ./stop-and-remove.sh )
}
echo "-------------------------------------"
docker ps -a
sleep 5
# See if port 5000 gets used
# PORT_NUM_5000=`lsof -ti:5000`
# echo "Port 5000 open on: $PORT_NUM_5000"
# lsof -ti:5000 > /dev/null
# exitCode=$?
# if [[ $exitCode != 0 ]]; then
#     echo "Port 5000 not open..."
#     # tear_down
#     exit $exitCode
# fi

# After some time it appears a(n) image(s) dies
echo "-------------------------------------"
docker ps -a
# Count number of bluebird instances that have been created.
# This should be equal to NUM_BLUEBIRD + 1, i.e. NUM_BLUEBIRD + 1 Bluesky instance
NUM_BLUEBIRD=`cat ../../nats/away-day/run-containers.sh | awk -F "=" 'NR==3{print $2}'`
NUM_DOCKER_IMAGES=`docker ps -q | wc | awk '{print $1}'`
NUM_BLUEBIRD_INSTANCES=`expr $NUM_DOCKER_IMAGES - 1`

echo "NUM_BLUEBIRD = $NUM_BLUEBIRD"
echo "NUM_DOCKER_IMAGES = $NUM_DOCKER_IMAGES"
echo "NUM_BLUEBIRD_INSTANCES = $NUM_BLUEBIRD_INSTANCES"

if [[ $NUM_BLUEBIRD != $NUM_BLUEBIRD_INSTANCES ]]; then
    echo "They are not equal"
    tear_down
    exit 1
fi
tear_down

The output of which for a "good" commit is shown below (note the NUM_BLUEBIRD and NUM_BLUEBIRD_INSTANCES are equal)

-------------------------------------
CONTAINER ID        IMAGE                          COMMAND                  CREATED             STATUS                  PORTS                              NAMES
696f105e269e        turinginst/bluebird:away-day   "/bin/sh -c 'python …"   1 second ago        Up Less than a second   0.0.0.0:5002->5001/tcp             bluebird-2
537e2ce1cc2a        turinginst/bluebird:away-day   "/bin/sh -c 'python …"   2 seconds ago       Up 1 second             0.0.0.0:5001->5001/tcp             bluebird-1
f4afcd10ae6c        turinginst/bluebird:away-day   "/bin/sh -c 'python …"   3 seconds ago       Up 2 seconds            0.0.0.0:5000->5001/tcp             bluebird-0
7c051fc3661d        turinginst/bluesky:1.2.2       "python BlueSky.py -…"   4 seconds ago       Up 2 seconds            0.0.0.0:9000-9001->9000-9001/tcp   bluesky
-------------------------------------
CONTAINER ID        IMAGE                          COMMAND                  CREATED             STATUS              PORTS                              NAMES
696f105e269e        turinginst/bluebird:away-day   "/bin/sh -c 'python …"   6 seconds ago       Up 5 seconds        0.0.0.0:5002->5001/tcp             bluebird-2
537e2ce1cc2a        turinginst/bluebird:away-day   "/bin/sh -c 'python …"   7 seconds ago       Up 6 seconds        0.0.0.0:5001->5001/tcp             bluebird-1
f4afcd10ae6c        turinginst/bluebird:away-day   "/bin/sh -c 'python …"   8 seconds ago       Up 7 seconds        0.0.0.0:5000->5001/tcp             bluebird-0
7c051fc3661d        turinginst/bluesky:1.2.2       "python BlueSky.py -…"   9 seconds ago       Up 7 seconds        0.0.0.0:9000-9001->9000-9001/tcp   bluesky
NUM_BLUEBIRD = 3
NUM_DOCKER_IMAGES = 4
NUM_BLUEBIRD_INSTANCES = 3
696f105e269e
537e2ce1cc2a
f4afcd10ae6c
7c051fc3661d

Below is example output from a "bad" commit:

-------------------------------------
CONTAINER ID        IMAGE                          COMMAND                  CREATED                  STATUS                  PORTS                              NAMES
bbc554295da9        turinginst/bluebird:away-day   "/bin/sh -c 'python …"   Less than a second ago   Up Less than a second   0.0.0.0:5002->5001/tcp             bluebird-2
3acaed1e4e74        turinginst/bluebird:away-day   "/bin/sh -c 'python …"   2 seconds ago            Up 1 second             0.0.0.0:5001->5001/tcp             bluebird-1
8a017ee78025        turinginst/bluebird:away-day   "/bin/sh -c 'python …"   3 seconds ago            Up 2 seconds            0.0.0.0:5000->5001/tcp             bluebird-0
a58b77a64e14        turinginst/bluesky:1.2.2       "python BlueSky.py -…"   4 seconds ago            Up 2 seconds            0.0.0.0:9000-9001->9000-9001/tcp   bluesky
-------------------------------------
CONTAINER ID        IMAGE                          COMMAND                  CREATED             STATUS              PORTS                              NAMES
bbc554295da9        turinginst/bluebird:away-day   "/bin/sh -c 'python …"   6 seconds ago       Up 5 seconds        0.0.0.0:5002->5001/tcp             bluebird-2
3acaed1e4e74        turinginst/bluebird:away-day   "/bin/sh -c 'python …"   7 seconds ago       Up 6 seconds        0.0.0.0:5001->5001/tcp             bluebird-1
a58b77a64e14        turinginst/bluesky:1.2.2       "python BlueSky.py -…"   9 seconds ago       Up 7 seconds        0.0.0.0:9000-9001->9000-9001/tcp   bluesky
NUM_BLUEBIRD = 3
NUM_DOCKER_IMAGES = 3
NUM_BLUEBIRD_INSTANCES = 2
They are not equal
bbc554295da9
3acaed1e4e74
a58b77a64e14

Note they are no longer equal. This was discovered to be introduced in commit f6c9cd6

This may potentially be linked to issue #76

evelinag commented 5 years ago

I edited the sector file (commit https://github.com/alan-turing-institute/bluebird/commit/6ef9c7421264c0baa2c0070650fd3ebbd92cd57c), duplicating the first sector that effectively gets ignored because the corresponding bluebird crashes.

This is hopefully a workable workaround that's going to work for the away day.

thobson88 commented 5 years ago

I've added this workaround in the scenario generation package.

tallamjr commented 5 years ago

Just a comment for my own reference :

Regardless of the port it is mapped to, the first instance always seem to die

rkm commented 4 years ago

Closing this - We can investigate if we plan to run that demo again.