jobs failing when run against multiple controllers

kwmonroe commented 7 years ago

I am not able to get clean cwr runs when i have multiple controllers registered. Even though separate models are created on each controller, it seems as though cwr is running "juju deploy foo" with whatever model is active at the time. Here's console output on a job running on both lxd and gce controllers:

2017-03-16 21:54:57 DEBUG Bootstrap environment: lxd:job-3-star-koi
2017-03-16 21:54:57 DEBUG Connecting to lxd:job-3-star-koi...
2017-03-16 21:54:58 DEBUG Connected.
2017-03-16 21:54:58 DEBUG deploy juju deploy /tmp/cwr-tmp-cyW2lZ/bundletester-EDSoUP/cs__kwmonroe_java_devenv/bundle-cwr.yaml

2017-03-16 21:55:01 DEBUG Bootstrap environment: gce-w:job-3-star-koi
2017-03-16 21:55:03 DEBUG Connecting to gce-w:job-3-star-koi...
2017-03-16 21:55:05 DEBUG deploy juju deploy /tmp/cwr-tmp-cyW2lZ/bundletester-yZNCNJ/cs__kwmonroe_java_devenv/bundle-cwr.yaml

Notice the juju deploy command is not using the -m <controller:model> syntax to place the deployment on a specific model.

This causes failures because, for example, when the lxd cwr run completes, it tears down the model. Since the gce cwr apparently deploys to the same model, the lxd tear down causes gce to fail. In this example, we see the gce deploy attempting to add relations in (what i think is) the now-destroyed model that lxd just used:

2017-03-16 21:27:57 DEBUG 2017-03-16 21:27:57 Adding relations...
2017-03-16 21:27:57 DEBUG 2017-03-16 21:27:57  Adding relation ubuntu-devenv:java <-> openjdk:java
2017-03-16 21:27:57 DEBUG Traceback (most recent call last):
2017-03-16 21:27:57 DEBUG   File "/usr/local/bin/juju-deployer", line 11, in <module>
2017-03-16 21:27:57 DEBUG     sys.exit(main())
2017-03-16 21:27:57 DEBUG   File "/usr/local/lib/python2.7/dist-packages/deployer/cli.py", line 140, in main
2017-03-16 21:27:57 DEBUG     run()
2017-03-16 21:27:57 DEBUG   File "/usr/local/lib/python2.7/dist-packages/deployer/cli.py", line 250, in run
2017-03-16 21:27:57 DEBUG     importer.Importer(env, deployment, options).run()
2017-03-16 21:27:57 DEBUG   File "/usr/local/lib/python2.7/dist-packages/deployer/action/importer.py", line 316, in run
2017-03-16 21:27:57 DEBUG     rels_created = self.add_relations()
2017-03-16 21:27:57 DEBUG   File "/usr/local/lib/python2.7/dist-packages/deployer/action/importer.py", line 229, in add_relations
2017-03-16 21:27:57 DEBUG     self.env.add_relation(end_a, end_b)
2017-03-16 21:27:57 DEBUG   File "/usr/local/lib/python2.7/dist-packages/deployer/env/go.py", line 70, in add_relation
2017-03-16 21:27:57 DEBUG     return self.client.add_relation(endpoint_a, endpoint_b)
2017-03-16 21:27:57 DEBUG   File "/usr/local/lib/python2.7/dist-packages/jujuclient/environment.py", line 377, in add_relation
2017-03-16 21:27:57 DEBUG     return self.service.add_relation(*args, **kws)
2017-03-16 21:27:57 DEBUG   File "/usr/local/lib/python2.7/dist-packages/jujuclient/juju2/facades.py", line 1053, in add_relation
2017-03-16 21:27:57 DEBUG     'endpoints': [endpoint_a, endpoint_b]
2017-03-16 21:27:57 DEBUG   File "/usr/local/lib/python2.7/dist-packages/jujuclient/facades.py", line 72, in rpc
2017-03-16 21:27:57 DEBUG     return self.env._rpc(self.check_op(op))
2017-03-16 21:27:57 DEBUG   File "/usr/local/lib/python2.7/dist-packages/jujuclient/rpc.py", line 42, in _rpc
2017-03-16 21:27:57 DEBUG     raise EnvError(result)
2017-03-16 21:27:57 DEBUG jujuclient.exc.EnvError: <Env Error - Details:
2017-03-16 21:27:57 DEBUG  {   u'error': u'application "openjdk" not found',
2017-03-16 21:27:57 DEBUG     u'error-code': u'not found',
2017-03-16 21:27:57 DEBUG     u'request-id': 3,
2017-03-16 21:27:57 DEBUG     u'response': {   }}
2017-03-16 21:27:57 DEBUG  >

I think the solution may lie in bundletester or cloud-weather-report. Ensure those applications are using the -m <controller:model> syntax any time they issue a juju command.

A quick fix might be to isolate each run_in_container call so that we have a cwr process in its own cwrbox running for each controller. The downside of this would be that controller tests would happen sequentially -- that is, the entire lxd cwr run would happen first, followed by the entire gce cwr run.

arosales commented 7 years ago

I am reopening this bug as the container fix is a workaround to enable us to run multiple clouds, but the low throughput this causes is unsustainable in a CI system.

A fix should be comparable to when we were using juju-deployer and could target multiple clouds with the bottleneck being the tests, not the infrastructure.

I consider this a critical bug, as without a fix the CI system is very crippled for anyone wanting to run more than 3 bundle tests per day.

-thanks, Antonio

kwmonroe commented 7 years ago

@arosales good points. Thanks for reopening. To put numbers behind your statements, consider this:

http://bigtop.charm.qa/cwr_bundle_hadoop_processing/3/report.html

You can see that our core bundle (the smallest of our big data bundles) takes 3+ hours since we run the clouds sequentially... And that's not even including gce :/

seman commented 7 years ago

@kwmonroe why not run multiple cloud tests in multiple containers (container per cloud) at the same time?

kwmonroe commented 7 years ago

why not run multiple cloud tests in multiple containers (container per cloud) at the same time?

@seman because it would interweave the jenkins job console output. It make it almost impossible to debug jenkins job failures via the console output if multiple backgrounded cwr jobs are running.

That said, perhaps it would do the same thing if cloud-weather-report and/or bundletester supported multiple clouds. It's worth looking into, but for now, i've settled on waiting a long time for individual clouds to run.

johnsca commented 7 years ago

@kwmonroe Yeah, I think the output is going to be interleaved no matter how we parallelize the tests. Perhaps we can address that by changing the logging config in cwr to prefix the log lines with the controller?

We're already passing the environment (which includes the controller) to bundletester, and it's using a with utils.juju_env(options.environment): context manager to switch to that env. The issue here is that running the clouds in parallel causes the different threads running BT to fight over that active env. We could change BT to explicitly pass around the env to each of the tests that it runs; Matrix already has support for that, but Amulet tests don't have any convention to accept or honor it. We could update Amulet's default_environment to honor, e.g., an env var, but we'd still have a corner case of any executable-style test that doesn't use Amulet. I'm not sure if that's a case that we care too much about supporting, but it's there.

All in all, it does seem like @seman's suggestion of handling the parallelization at a higher level would be easier. Though, then we'd be letting a bug lie in the cwr CLI experience by working around it at a higher layer.

kwmonroe commented 7 years ago

@johnsca @seman: i tried running with the following in ./scripts/cwr-helpers.sh:

@@ -407,9 +423,10 @@ function run_cwr_in_container() {
                exit 1
            fi
       else
-          run_in_container cwr-helpers.sh run_cwr "$MODELS_TO_TEST" "$@"
+          run_in_container cwr-helpers.sh run_cwr "$MODELS_TO_TEST" "$@" &
       fi
     done
+    wait

All hell broke loose. On a job set to run on 4 clouds, the gce job didn't run at all:

http://bigtop.charm.qa/cwr_bundle_spark_processing/3/report.html

And aws had double the expected machines (I think juju deployed the stuff for gce on the aws model). Perhaps it's timing; i'll try sleeping a bit before firing off another run_in_container.

seman commented 7 years ago

Other option can be to create a Jenkins job for each cloud or each registered controller. So when someone runs cwr-charm-commit, we will create Jenkins job for each controller. These jobs will be exactly the same except the controller name will be different. This will keep the console logs clean.

johnsca commented 7 years ago

@kwmonroe I don't think it's a timing issue. You should verify what the value of $MODELS_TO_TEST is for each run_in_container call. Since it's a global, I suspect some overlap there, though I'm not sure how. Might also be related to the state of ~/.local/share/juju when it gets copied in to each container.

kwmonroe commented 7 years ago

I verified MODELS_TO_TEST looks good, and I tried sleeping 5 minutes between each loop, but still didn't get consistent results:

http://bigtop.charm.qa/cwr_bundle_hadoop_processing/7/report.html

LXD shows odd stuff like "Unable to connect to: <controller>:8443" and getresponse() got an unexpected keyword argument 'buffering'. Maybe unrelated, but for now, I'm resigning to simply waiting for sequential runs.

Might also be related to the state of ~/.local/share/juju when it gets copied in to each container.

Good point. I know there used to be stuff like "current-controller" in there, and though I don't see that file now, there could be other state that's changing between runs that I'm not considering.

johnsca commented 7 years ago

controllers.yaml contains a current-controller field. Still, that value would be set to the last controller in the list previously and it shouldn't matter anyway because BT does an explicit juju switch with the fully-qualified model.

johnsca commented 7 years ago

This comment is relevant to the getresponse() error, and basically says that the getresponse() portion of the stack trace is superfluous and the important bit is the connection error to the charm store. In fact, all of the failures in that recent run look like connection or timeout issues.

juju-solutions / layer-cwr

jobs failing when run against multiple controllers #114