Containerize build using LXD

johnsca commented 7 years ago

This is a pretty significant refactor, obviously. I'd really like to see all of the logic not directly related to managing the LXD image, Jenkins jobs, and Juju config (and possibly the release logic) moved in to the underlying tooling (cwr, bundletester, matrix). Specifically, I think we need a well-defined way of providing general override information for bundles for the purposes of testing. This would need to cover not just overriding specific charms with other revs or builds from repos, but also things like adding a testing specific charm, overriding the default number of units, etc. Having all of that in the tooling would make the charm much simpler.

In the meantime, we might consider moving much of the logic into the cwrbox image. It would allow us to push out updates to the logic in the container that would be picked up on the next build (unless a given deployment was using a locally attached resource version of the cwrbox image, in which case it would be manual for that deployment).

On the point of the image source, manually hosting the tarball in S3 was the quickest way to have it work out of the box, but is less than ideal. Ideally, we could run a public LXD remote server, but that would require more resources, a domain, and I'm not sure how to—or if you even can—lock down all operations other than copying images from it. I also looked in to running a simplestreams host for the images, which would be read-only out of the box, but that requires repackaging the image that gets exported (because simplestreams doesn't support unified images and only supports xz compression), and we'd still need to host that.

ktsakalozos commented 7 years ago

Nice work! It is a lot of work, I wish we could have done it in more steps so it would be easier.

In any case, I have taken it for a spin in aws and lxd, here are the errors I got: http://pastebin.ubuntu.com/24034264/ and http://pastebin.ubuntu.com/24033990/

Is noble-spider your pet? :)

johnsca commented 7 years ago

@ktsakalozos Ah, I missed that lxd init would need to be run when deploying on a fresh machine / VM. I also improved the job console output by turning off script debugging, adding some additional informational echos, and ensuring that set -e is always on.

ktsakalozos commented 7 years ago

I removed the old cwr subordinate and added the new one and got the follwoing error: http://pastebin.ubuntu.com/24039183/

Then I logged in jenkins and did a lxc image remove cwrbox After resolving the above error I got this one: http://pastebin.ubuntu.com/24039207/

On a clean install of jenkins+cwr on lxd: http://pastebin.ubuntu.com/24039408/

johnsca commented 7 years ago

I rebased against master and fixed the NoneType exception (run_as doesn't pass through kwargs like I thought it did).

The second failure is somewhat expected; if you delete the image you'll also need to remove the signature file from /var/lib/jenkins/cwrbox.tar.gz.sig or the hash value from unitdata to get it to re-import the image. However, it looks like the set -e is not working for some reason, and that's a significant issue but I can't see any possible cause.

The last error I can't replicate, likely because I'm using ZFS for my LXD storage. I'll try to replicate by bootstrapping Juju with LXD on an Amazon instance, but any debugging you can do on your end would be appreciated.

johnsca commented 7 years ago

This seems to be the issue with -e: https://stackoverflow.com/questions/4072984/set-e-in-a-function

johnsca commented 7 years ago

All of the issues that @ktsakalozos hit are resolved now.

kwmonroe commented 7 years ago

This is working great for me.. I tested with cwr-52 and ran a charm and bundle job concurrently. Watching ps on the jenkins unit, i saw multiple cwr processes with multiple containers being active. This is a huge improvement -- previously 2 simultaneous cwr processes had a high likelihood of stomping each other's system-level deps.

I really want to push the merge button because i'm that excited about this. However, I'll let @ktsakalozos do it so he can verify his earlier comments have been addressed in cwr-52.

+1, lgtm.

kwmonroe commented 7 years ago

Nooooo! I spoke too soon.. Bundle job finished clean, but charm job hit a connection timed out :(

http://juju.does-it.net:8081/job/charm_openjdk_in_cs__kwmonroe_bundle_java_devenv/6/consoleFull

Edit: seemed to be a transient issue. Re-running both jobs succeeded. I retract my "Noooooo", but I would like to see the connection timed out issue handled better.

johnsca commented 7 years ago

@kwmonroe The timeout seems to be from deployer connecting to the API in the middle of a test run (during "reset") so doesn't seem related to this PR. It also seems to have cleared up on a subsequent run.

lazypower commented 7 years ago

This looks super cool but travis seems to hate it :(

johnsca commented 7 years ago

@chuckbutler The Travis failures are due to an upstream packaging issue with libcharmstore when installing charm-tools on trusty. We're waiting on @marcoceppi to resolve that. I tried to use the snap, but that failed due to this issue. I'd like it if we could figure out a way to use the snap in Travis but I have no idea how to proceed there.

pengale commented 7 years ago

Overall, I am +1 on this. Nothing major jumped out to me in a readthrough of the code, and I'm able to deploy without errors to aws, and setup and run the tests.

pengale commented 7 years ago

@kwmonroe The timeout that you ran into is more likely a problem with the charm in general, rather than a problem with containerizing, correct?

If so, I think that we should merge this ...

ktsakalozos commented 7 years ago

LGTM2! Merging it!

juju-solutions / layer-cwr

Containerize build using LXD #92