Re-architect builder system to address existing constraints

tdcox commented 5 years ago

I propose that we look at re-architecting the builder system to increase the performance of the main build, add flexibility for third party extensions and to address the following issues:

We aren’t updating our builders enough, so they all generate hundreds of CVE errors due to out of date dependencies.
We don’t have a good mechanism for versioning builders so we end up with production applications that we can’t re-release without being forced to deal with breaking changes derived from updated dependencies in builders. It will be necessary to be able to pin builder versions for extended support periods for corporate customers and to have multiple versions of builders available in parallel.
Theoretically, some dependencies may get removed from the web in future and we would not be able to enable users to continue to run code based on the last known good release of those dependencies under the current system.
There are scenarios where we will need to support multiple flavours of a particular type of builder due to conflicting combinations of shared dependencies. The python builders are already running up against this.
Some builder families like machine-learning are large and really require multiple flavours to address commonly used groups of shared dependencies. For example, a given version of pytorch or tensorflow requires a specific version of python, CUDA, nVidia driver libraries etc., and there are breaking changes between these versions.
It is currently hard to facilitate third parties maintaining builder extensions within the platform as updates to these are tied to the platform release process.

tdcox commented 5 years ago

@abayer

abayer commented 5 years ago

So I've come to believe that machine learning is a canary in the coal mine - the ML/ML GPU builders eat up close to half of image build time, and even the changes I've been making to use base images isn't really helping. Kaniko seems to cause node instability issues at the scale of the ML builder images as well, which is...fun. But it's not just ML, it's the whole architecture, as @tdcox says here. For example, right now, we've been stuck in a weeklong cascade of release process issues that require fixing in jx, changes propagating through jenkins-x-builders, updates to jenkins-x.ymls, then jenkins-x-platform, etc - the turnaround time for a one-line change in jx is ages. This just doesn't work.

We need to be able to have builders updated independently of the jx binary in some way. @tdcox has an idea around a volume containing jx that gets mounted in via the pod template (or some alternative approach), but I honestly don't have an opinion on how we fix this, just that we absolutely must fix this.

cc @pmuir @garethjevans @jstrachan @rawlingsj

abayer commented 5 years ago

I'm not sure about the volume approach - for builders, I suppose we could add something to jenkins-x-versions that defined the jx binary version, and then at the beginning of each pipeline execution, create a volume for that pipeline, downloading the appropriate jx binary to it, and then mounting that volume into the pod and containers, but I'm not sure if that works well. We definitely can't do a shared volume across the cluster, so it would have to be a per-pipeline volume. And even if that worked, we'd have to do something else for controllers.

My gut instinct is for adding the desired jx binary version to jenkins-x-versions somehow, adding a flag to jx upgrade cli like --from-version-stream that decides the version to upgrade to by looking in the version stream, baking a jx binary in (one that we'd change very rarely), and then running jx upgrade cli -b --from-version-stream at container init time. Thoughts?

tdcox commented 5 years ago

Architecturally, you are dealing with a caching problem against multiple dependencies. At the moment, you are doing all the assembly at source then moving the cached copy out to where it is consumed.

You could move the full assembly to the extreme opposite end of the operation and do it as a part of each customer build operation but then you are taking the delay that you are trying to avoid and giving it to every customer multiplied by the number of builds they run. Not good.

The best avenue is probably to look at creating intermediate assemblies at source and then shipping these to the customer for one-off assembly and local caching at platform upgrade time.

This is only going to work if you can separate rapidly changing elements (jx) from slowly changing elements (python 2.7) and cache the intermediate stages. Otherwise you are still going to have to build all the images to run your end-to-end testing.

On top of this, we need a mechanism that supports many more combinations of dependencies and allows for more frequent updates of base container images (to cope with CVE patching on OS and tooling) and also permits more variations of dependencies across time (python 3.6, pytorch 1.0.1, CUDA 9 vs python 3.7, pytorch 1.1.0, CUDA 10.1 for example).