Open mattmoor opened 3 years ago
Goal: eliminate queue-proxy dependency on k8s.io/client-go/kubernetes
Two offenders:
k8s.io/client-go/informers
is pulled in via the informed watcher in knative.dev/pkg/configmap
(unused in queue-proxy): https://github.com/knative/pkg/pull/1851k8s.io/client-go/kubernetes
is pulled in directly via knative.dev/pkg/metrics
for the stackdriver exporter to fetch secrets.Pulling in the above PR and commenting the kubeClient references in knative.dev/pkg/metrics
yields a binary size of 42,193,344
(32% reduction in size).
Goal: eliminate queue-proxy dependency on contrib.go.opencensus.io/exporter/stackdriver
Two offenders:
knative.dev/pkg/metrics
knative.dev/pkg/tracing
Commenting out the stackdriver logic in these packages further reduces the queue-proxy binary size to 35,560,653
(another 18% reduction).
cc @evankanderson
In my build opencensus also seems to pull 793066 T github.com/aws/aws-sdk-go/aws/endpoints.init
which with other aws stuff is another meg.
FWIW, that's likely included in my figures, unless it is transitively pulled in through other things. I'm measuring the overall binary size once a particular cut point in the dependency graph has been snipped.
Shouldn't effect the code pages, but @tcnghia also found that if we use:
GOFLAGS=-ldflags="-s -w"
It drops things another ~10MB (on top of what I measured above). Seems like the lowest hanging fruit yet.
https://blog.filippo.io/shrink-your-go-binaries-with-this-one-weird-trick/
@evankanderson do you want to link to or summarize your thoughts on timeline to cut this dependency?
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen
. Mark the issue as
fresh by adding the comment /remove-lifecycle stale
.
/remove-lifecycle stale
This reminded me of reading https://www.cockroachlabs.com/blog/go-file-size/
I don't know how up to date it is, but maybe it would be reasonable to consider another language if we wanted to minimize the binary?
For the stackdriver dependency, we're in the (somewhat slow) process of rolling out a queue-proxy that uses Otel to our fleet. We should be OK to delete the support at least in queue-proxy. I don't think we've migrated knative-gcp over to OTel yet, though, so maybe it could be done using build tags?
I believe we also need to publish some docs on how to configure OTel in place of the old stackdriver support, I can try to get that done soon.
/lifecycle frozen
Yeah, I poked through that when I first opened this.
maybe it would be reasonable to consider another language if we wanted to minimize the binary?
While I think minimizing the binary is a nice goal, I also think there are practical considerations. I'm reminded of Kubernetes debating a rewriting the pause
container in assembly, but the practical implications of maintaining that vs. something slightly larger in a higher-level language wasn't worth it.
Likewise here, I don't think the goal is to have an O(KB) QP, but it'd be nice if we could keep it away from O(100MB) 😉
maybe it could be done using build tags
I already added nostackdriver
to enable us to check that we don't introduce these deps in new ways (other than metrics/tracing), but nothing builds with that currently. I'd love to just rip that out when y'all are ready 🤩
Gotcha 😃 That all makes sense to me!
While I think minimizing the binary is a nice goal, I also think there are practical considerations. I'm reminded of Kubernetes debating a rewriting the pause container in assembly, but the practical implications of maintaining that vs. something slightly larger in a higher-level language wasn't worth it. Likewise here, I don't think the goal is to have an O(KB) QP, but it'd be nice if we could keep it away from O(100MB) 😉
Totally agree, after I wrote that I immediately thought: well, what language, then? 🚲 Maaaybe something like Rust, but that's a stretch for maintenance, though it does sound kind of fun as a proof of concept.
I already added nostackdriver to enable us to check that we don't introduce these deps in new ways (other than metrics/tracing), but nothing builds with that currently. I'd love to just rip that out when y'all are ready 🤩
Will try to keep this updated with our progress, we're close enough that I wouldn't mind seeing the stackdriver deletion in 0.21 or 0.22, wdyt? At this point I think it's fair for /us/ to take on the pain of porting that patch forward if we absolutely must 🤕
Maaaybe something like Rust, ... it does sound kind of fun as a proof of concept.
Have you met @julz 😉
wdyt?
Defer to @evankanderson who's been tracking the broader effort most closely.
Maaaybe something like Rust, ... it does sound kind of fun as a proof of concept.
I mean envoy is written in c++ and has survived ok so.. rust may not be so bad (and it has worked well for linkerd)! I started a repo at http://github.com/julz/roo-proxy a while ago to experiment a bit (without much progress yet, tho!). One thing I'd really like to try is to see how much of QP we could reimplement as a wasm plug-in for envoy-- that way we could reuse all of the great work and optimisations and features in envoy and (for people who already have mesh) avoid needing a second sidecar at all.
Anyway, totally ack that there's a maintainability trade-off here, just this may be a case where it's warranted (empirically QP is very expensive for what it does), and rust is pretty ok.
FWIW, just to throw my hat in here too (not the red one necessarily): I've been tinkering on a rust-based queue-proxy in my friday learning time as well, without a ton of progress either. Currently it serves mostly as a segway for me to actually learn the language, but the queue-proxy in itself is a rather complex piece of software by now, I must say :joy:. It ain't trivially rewritten in a day or two.
It'd definitely be interesting to see the diff in resource consumption and performance
Maintainability trade-off... definitely!
/assign
/triage accepted
It sounds like Google may be looking to sustain Stackdriver support for longer than expected... I'm going to try to switch pkg/metrics
to a registration pattern, and make stackdriver
, prometheus
and opencensus
be sub-modules which can register themselves, which should allow Google to maintain the stackdriver component outside of pkg
but still link it against all the binaries.
We've dropped the stackdriver exporters here: https://github.com/knative/pkg/issues/2173
This has dropped the queue proxy size (on my Mac) from 50MB to 30MB - so a diff of 20MB (~40%).
There might be some additional gains to be made and I'll poke at this a bit more
So there are still gains to be made by dropping k8s.io/api/core/v1 &
k8s.io/api/apps/v1` but it requires a lot of reshuffling of packages and types.
Specifically the list is
1) https://github.com/knative/networking/pull/670
1) Move autoscaler/metric.Stat
to it's own package so it doesn't pull in any autoscaling APIs
1) Don't consume corev1.Probe in the queue proxy - but create a copy of it
1) Move knative.dev/configmap.Parse
methods to their own package - no corev1 dep
1) Drop corev1 usage in knative.dev/pkg/metrics
1) drop use of corev1.Secret
1) drop use of corev1.ConfigMap - move ConfigMapWatcher to it's own package
1) Drop corev1 usage in knative.dev/pkg/tracing
1) drop use of corev1.ConfigMap
I tested changes 1-3 and removed tracing and metrics in order to see what the floor of completing 4-6 would be - that resulted in a queue-proxy binary size of 15MB.
There's probably a gain to be had when we removed it's copy of prometheus libs - ie. https://github.com/knative/serving/issues/11126
/unassign @mattmoor /unassign @evankanderson
Maaaybe something like Rust, ... it does sound kind of fun as a proof of concept.
I mean envoy is written in c++ and has survived ok so.. rust may not be so bad (and it has worked well for linkerd)! I started a repo at http://github.com/julz/roo-proxy a while ago to experiment a bit (without much progress yet, tho!). One thing I'd really like to try is to see how much of QP we could reimplement as a wasm plug-in for envoy-- that way we could reuse all of the great work and optimisations and features in envoy and (for people who already have mesh) avoid needing a second sidecar at all.
Anyway, totally ack that there's a maintainability trade-off here, just this may be a case where it's warranted (empirically QP is very expensive for what it does), and rust is pretty ok.
@julz The topic is a little old, but I am wondering are there any findings/update for reimplementing QP as a wasm plug-in in envoy? The idea is promising even now.
/area API /area autoscale /area networking
What version of Knative?
HEAD
Expected Behavior
queue-proxy is trivially small.
Actual Behavior
As of this morning, queue-proxy is ~55.7MB.
Steps to Reproduce the Problem
Going to self-assign, but feel free to jump in and party on this with me. /assign