juju-solutions / matrix

Automatic testing of big software deployments under various failure conditions
Other
8 stars 9 forks source link

matrix never finishes (sometimes) #81

Closed kwmonroe closed 7 years ago

kwmonroe commented 7 years ago

I ran cwr on a bundle last night on my local provider as well as a gce cloud. Local finished quickly, but today, there's still a model running on my gce controller. The matrix.log on the unit where matrix is running reports the following over and over again:

health:56:health: Health check: busy

All my units are settled with workload status active and agent status idle:

$ juju status -m matrix-saving-osprey
Model                 Controller  Cloud/Region        Version
matrix-saving-osprey  gce-c       google/us-central1  2.0.3

App      Version  Status  Scale  Charm          Store       Rev  OS      Notes
devenv            active      1  ubuntu-devenv  jujucharms    4  ubuntu
openjdk           active      1  openjdk        jujucharms    5  ubuntu

Unit          Workload  Agent  Machine  Public address   Ports  Message
devenv/0*     active    idle   0        104.197.176.159         devenv ready with: java
  openjdk/0*  active    idle            104.197.176.159         OpenJDK 8 (jre) installed

Machine  State    DNS              Inst id        Series  AZ
0        started  104.197.176.159  juju-cc93dd-0  xenial  us-central1-a

Relation  Provides  Consumes  Type
java      devenv    openjdk   subordinate

The since times are all well past 30s of the current time, which should cause the health check to acknowledge everything is healthy:

$ date
Thu Feb 16 16:53:23 UTC 2017

$ juju status -m matrix-saving-osprey --format=yaml | grep -i since
      since: 15 Feb 2017 23:44:10Z
      since: 15 Feb 2017 23:42:08Z
      since: 16 Feb 2017 16:50:37Z
          since: 16 Feb 2017 16:50:37Z
          since: 16 Feb 2017 16:50:37Z
              since: 15 Feb 2017 23:45:32Z
              since: 16 Feb 2017 16:49:59Z
      since: 15 Feb 2017 23:45:32Z

@johnsca thinks this might be caused by the connection being lost and matrix using stale data. Therefore the health check never sees the current workload/agent status as being active/idle.

pengale commented 7 years ago

I think that matrix needs a generalized timeout for potentially long running tests. Will work on figuring it out.

pengale commented 7 years ago

PR: https://github.com/juju-solutions/matrix/pull/83

kwmonroe commented 7 years ago

Closing out since #83 was merged.