juju-solutions / matrix

Automatic testing of big software deployments under various failure conditions
Other
8 stars 9 forks source link

matrix not honoring bundle constraints #113

Closed kwmonroe closed 7 years ago

kwmonroe commented 7 years ago

Hi friends! I'm running the spark-processing bundle.yaml through cwr, which invokes bundletester/matrix. Here's my yaml:

https://api.jujucharms.com/charmstore/v5/spark-processing/archive/bundle.yaml

Note the constraints: "mem=7G root-disk=32G" on the spark application, for example. When matrix spins up my bundle for the first time (not chaotically), it seems to lose those constraints. I know this because a 7g machine on aws should be 2 cores, while a 7g machine on gce is 8 cores. Here's an example of the matrix models that were created on both aws and gce. Note the Cores column:

ubuntu@juju-b47b48-ci-shared-0:~$ for i in aws-w gce-c; do juju models -c $i; done
Controller: aws-w

Model                          Cloud/Region   Status     Machines  Cores  Access  Last connection
ci-70/job-22-matrix-set-tiger  aws/us-west-1  available         6      6          never connected
ci-70/job-22-steady-mutt       aws/us-west-1  available         7      9  admin   27 minutes ago

Controller: gce-c

Model                           Cloud/Region        Status     Machines  Cores  Access  Last connection
ci-70/job-22-matrix-sought-asp  google/us-central1  available         6      6          never connected
ci-70/job-22-steady-mutt        google/us-central1  available         7     21  admin   5 minutes ago

The ci-70/job-22-steady-mutt models are correct (verified by ssh'ing to the spark/0 unit and seeing 8 cores on gce, for example). The *-matrix-* models are incorrect (verified by ssh'ing to the spark/0 unit and seeing only 1 core on gce).

Why you lose my constraints?

kwmonroe commented 7 years ago

I should note the reason this is a big deal to me.. Big data bundles have their constraints because apps may not even start with cloud default instance sizes (1cpu, 1.7g ram for the big 3). Sometimes a cloud will surprise me and get itself up before the timeout, but in general, matrix is not useful for big data on clouds -- without handling constraints, it's only reliable on lxd (where constraints are moot as long as the host machine is big enough).

@petevg mentioned this may be libjuju that loses the constraint somewhere, so if there's a better place to open this issue, please lmk.

kwmonroe commented 7 years ago

This may not be the matrix... Hang tight while I get some feedback on:

https://bugs.launchpad.net/juju/+bug/1676986

pengale commented 7 years ago

@kwmonroe Interesting. I was going to say that this is a python-libjuju bug, because matrix basically just calls out to python-libjuju and asks it to deploy stuff. But if you were able to replicate with the vanilla client, that might mean that it's a more interesting bug ...

pengale commented 7 years ago

Pulling into the Beta milestone just to remind me to look at it. May not be a matrix bug, per above discussion.

pengale commented 7 years ago

Set as "blocked", as this is either an issue in Juju or an issue in python-libjuju -- needs to be fixed there, and then will automatically be fixed here ...

kwmonroe commented 7 years ago

Confirmed this is not a bug in matrix. It's all about juju failing to handle constraints on bundle "services". If I move the constraints to bundle "machines", all is well:

Controller: gce-w

Model                        Cloud/Region     Status     Machines  Cores  Access  Last connection
job-11-exact-cattle*         google/us-west1  available         0      -  admin   26 minutes ago
job-11-matrix-viable-goblin  google/us-west1  available         5     36  admin   10 minutes ago

Feel free to close this unless you want to keep it around for tracking purposes.

pengale commented 7 years ago

@kwmonroe Phew! I was worried that I had missed something about the constraints (there's a fair amount of code in python-libjuju that's just me translating that darn plan that the api generates into something that the api will actually accept on deploy). Glad to hear that it wasn't me :-)