Closed tdcox closed 4 years ago
@abayer
So I've come to believe that machine learning is a canary in the coal mine - the ML/ML GPU builders eat up close to half of image build time, and even the changes I've been making to use base images isn't really helping. Kaniko seems to cause node instability issues at the scale of the ML builder images as well, which is...fun. But it's not just ML, it's the whole architecture, as @tdcox says here. For example, right now, we've been stuck in a weeklong cascade of release process issues that require fixing in jx
, changes propagating through jenkins-x-builders
, updates to jenkins-x.yml
s, then jenkins-x-platform
, etc - the turnaround time for a one-line change in jx
is ages. This just doesn't work.
We need to be able to have builders updated independently of the jx
binary in some way. @tdcox has an idea around a volume containing jx
that gets mounted in via the pod template (or some alternative approach), but I honestly don't have an opinion on how we fix this, just that we absolutely must fix this.
cc @pmuir @garethjevans @jstrachan @rawlingsj
I'm not sure about the volume approach - for builders, I suppose we could add something to jenkins-x-versions
that defined the jx
binary version, and then at the beginning of each pipeline execution, create a volume for that pipeline, downloading the appropriate jx
binary to it, and then mounting that volume into the pod and containers, but I'm not sure if that works well. We definitely can't do a shared volume across the cluster, so it would have to be a per-pipeline volume. And even if that worked, we'd have to do something else for controllers.
My gut instinct is for adding the desired jx
binary version to jenkins-x-versions
somehow, adding a flag to jx upgrade cli
like --from-version-stream
that decides the version to upgrade to by looking in the version stream, baking a jx
binary in (one that we'd change very rarely), and then running jx upgrade cli -b --from-version-stream
at container init time. Thoughts?
Architecturally, you are dealing with a caching problem against multiple dependencies. At the moment, you are doing all the assembly at source then moving the cached copy out to where it is consumed.
You could move the full assembly to the extreme opposite end of the operation and do it as a part of each customer build operation but then you are taking the delay that you are trying to avoid and giving it to every customer multiplied by the number of builds they run. Not good.
The best avenue is probably to look at creating intermediate assemblies at source and then shipping these to the customer for one-off assembly and local caching at platform upgrade time.
This is only going to work if you can separate rapidly changing elements (jx) from slowly changing elements (python 2.7) and cache the intermediate stages. Otherwise you are still going to have to build all the images to run your end-to-end testing.
On top of this, we need a mechanism that supports many more combinations of dependencies and allows for more frequent updates of base container images (to cope with CVE patching on OS and tooling) and also permits more variations of dependencies across time (python 3.6, pytorch 1.0.1, CUDA 9 vs python 3.7, pytorch 1.1.0, CUDA 10.1 for example).
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://jenkins-x.io/community.
/lifecycle stale
/remove-lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://jenkins-x.io/community.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Provide feedback via https://jenkins-x.io/community.
/close
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Provide feedback via https://jenkins-x.io/community.
/close
@jenkins-x-bot: Closing this issue.
I propose that we look at re-architecting the builder system to increase the performance of the main build, add flexibility for third party extensions and to address the following issues:
We aren’t updating our builders enough, so they all generate hundreds of CVE errors due to out of date dependencies.
We don’t have a good mechanism for versioning builders so we end up with production applications that we can’t re-release without being forced to deal with breaking changes derived from updated dependencies in builders. It will be necessary to be able to pin builder versions for extended support periods for corporate customers and to have multiple versions of builders available in parallel.
Theoretically, some dependencies may get removed from the web in future and we would not be able to enable users to continue to run code based on the last known good release of those dependencies under the current system.
There are scenarios where we will need to support multiple flavours of a particular type of builder due to conflicting combinations of shared dependencies. The python builders are already running up against this.
Some builder families like machine-learning are large and really require multiple flavours to address commonly used groups of shared dependencies. For example, a given version of pytorch or tensorflow requires a specific version of python, CUDA, nVidia driver libraries etc., and there are breaking changes between these versions.
It is currently hard to facilitate third parties maintaining builder extensions within the platform as updates to these are tied to the platform release process.