knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.55k stars 1.16k forks source link

queue-proxy is huge #9957

Open mattmoor opened 3 years ago

mattmoor commented 3 years ago

/area API /area autoscale /area networking

What version of Knative?

HEAD

Expected Behavior

queue-proxy is trivially small.

Actual Behavior

As of this morning, queue-proxy is ~55.7MB.

Steps to Reproduce the Problem

GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build ./cmd/queue
ls -l queue

Going to self-assign, but feel free to jump in and party on this with me. /assign

mattmoor commented 3 years ago

Goal: eliminate queue-proxy dependency on k8s.io/client-go/kubernetes

Two offenders:

  1. k8s.io/client-go/informers is pulled in via the informed watcher in knative.dev/pkg/configmap (unused in queue-proxy): https://github.com/knative/pkg/pull/1851
  2. k8s.io/client-go/kubernetes is pulled in directly via knative.dev/pkg/metrics for the stackdriver exporter to fetch secrets.

Pulling in the above PR and commenting the kubeClient references in knative.dev/pkg/metrics yields a binary size of 42,193,344 (32% reduction in size).

mattmoor commented 3 years ago

Goal: eliminate queue-proxy dependency on contrib.go.opencensus.io/exporter/stackdriver

Two offenders:

  1. knative.dev/pkg/metrics
  2. knative.dev/pkg/tracing

Commenting out the stackdriver logic in these packages further reduces the queue-proxy binary size to 35,560,653 (another 18% reduction).

cc @evankanderson

vagababov commented 3 years ago

In my build opencensus also seems to pull 793066 T github.com/aws/aws-sdk-go/aws/endpoints.init which with other aws stuff is another meg.

mattmoor commented 3 years ago

FWIW, that's likely included in my figures, unless it is transitively pulled in through other things. I'm measuring the overall binary size once a particular cut point in the dependency graph has been snipped.

mattmoor commented 3 years ago

Shouldn't effect the code pages, but @tcnghia also found that if we use:

GOFLAGS=-ldflags="-s -w"

It drops things another ~10MB (on top of what I measured above). Seems like the lowest hanging fruit yet.

https://blog.filippo.io/shrink-your-go-binaries-with-this-one-weird-trick/

mattmoor commented 3 years ago

@evankanderson do you want to link to or summarize your thoughts on timeline to cut this dependency?

github-actions[bot] commented 3 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Cynocracy commented 3 years ago

/remove-lifecycle stale

Cynocracy commented 3 years ago

This reminded me of reading https://www.cockroachlabs.com/blog/go-file-size/

I don't know how up to date it is, but maybe it would be reasonable to consider another language if we wanted to minimize the binary?

For the stackdriver dependency, we're in the (somewhat slow) process of rolling out a queue-proxy that uses Otel to our fleet. We should be OK to delete the support at least in queue-proxy. I don't think we've migrated knative-gcp over to OTel yet, though, so maybe it could be done using build tags?

I believe we also need to publish some docs on how to configure OTel in place of the old stackdriver support, I can try to get that done soon.

mattmoor commented 3 years ago

/lifecycle frozen

Yeah, I poked through that when I first opened this.

maybe it would be reasonable to consider another language if we wanted to minimize the binary?

While I think minimizing the binary is a nice goal, I also think there are practical considerations. I'm reminded of Kubernetes debating a rewriting the pause container in assembly, but the practical implications of maintaining that vs. something slightly larger in a higher-level language wasn't worth it.

Likewise here, I don't think the goal is to have an O(KB) QP, but it'd be nice if we could keep it away from O(100MB) 😉

maybe it could be done using build tags

I already added nostackdriver to enable us to check that we don't introduce these deps in new ways (other than metrics/tracing), but nothing builds with that currently. I'd love to just rip that out when y'all are ready 🤩

Cynocracy commented 3 years ago

Gotcha 😃 That all makes sense to me!

While I think minimizing the binary is a nice goal, I also think there are practical considerations. I'm reminded of Kubernetes debating a rewriting the pause container in assembly, but the practical implications of maintaining that vs. something slightly larger in a higher-level language wasn't worth it. Likewise here, I don't think the goal is to have an O(KB) QP, but it'd be nice if we could keep it away from O(100MB) 😉

Totally agree, after I wrote that I immediately thought: well, what language, then? 🚲 Maaaybe something like Rust, but that's a stretch for maintenance, though it does sound kind of fun as a proof of concept.

I already added nostackdriver to enable us to check that we don't introduce these deps in new ways (other than metrics/tracing), but nothing builds with that currently. I'd love to just rip that out when y'all are ready 🤩

Will try to keep this updated with our progress, we're close enough that I wouldn't mind seeing the stackdriver deletion in 0.21 or 0.22, wdyt? At this point I think it's fair for /us/ to take on the pain of porting that patch forward if we absolutely must 🤕

mattmoor commented 3 years ago

Maaaybe something like Rust, ... it does sound kind of fun as a proof of concept.

Have you met @julz 😉

wdyt?

Defer to @evankanderson who's been tracking the broader effort most closely.

julz commented 3 years ago

Maaaybe something like Rust, ... it does sound kind of fun as a proof of concept.

I mean envoy is written in c++ and has survived ok so.. rust may not be so bad (and it has worked well for linkerd)! I started a repo at http://github.com/julz/roo-proxy a while ago to experiment a bit (without much progress yet, tho!). One thing I'd really like to try is to see how much of QP we could reimplement as a wasm plug-in for envoy-- that way we could reuse all of the great work and optimisations and features in envoy and (for people who already have mesh) avoid needing a second sidecar at all.

Anyway, totally ack that there's a maintainability trade-off here, just this may be a case where it's warranted (empirically QP is very expensive for what it does), and rust is pretty ok.

markusthoemmes commented 3 years ago

FWIW, just to throw my hat in here too (not the red one necessarily): I've been tinkering on a rust-based queue-proxy in my friday learning time as well, without a ton of progress either. Currently it serves mostly as a segway for me to actually learn the language, but the queue-proxy in itself is a rather complex piece of software by now, I must say :joy:. It ain't trivially rewritten in a day or two.

It'd definitely be interesting to see the diff in resource consumption and performance

Maintainability trade-off... definitely!

evankanderson commented 3 years ago

/assign

/triage accepted

It sounds like Google may be looking to sustain Stackdriver support for longer than expected... I'm going to try to switch pkg/metrics to a registration pattern, and make stackdriver, prometheus and opencensus be sub-modules which can register themselves, which should allow Google to maintain the stackdriver component outside of pkg but still link it against all the binaries.

dprotaso commented 3 years ago

We've dropped the stackdriver exporters here: https://github.com/knative/pkg/issues/2173

This has dropped the queue proxy size (on my Mac) from 50MB to 30MB - so a diff of 20MB (~40%).

There might be some additional gains to be made and I'll poke at this a bit more

dprotaso commented 3 years ago

So there are still gains to be made by dropping k8s.io/api/core/v1 &k8s.io/api/apps/v1` but it requires a lot of reshuffling of packages and types.

Specifically the list is

1) https://github.com/knative/networking/pull/670 1) Move autoscaler/metric.Stat to it's own package so it doesn't pull in any autoscaling APIs 1) Don't consume corev1.Probe in the queue proxy - but create a copy of it 1) Move knative.dev/configmap.Parse methods to their own package - no corev1 dep 1) Drop corev1 usage in knative.dev/pkg/metrics 1) drop use of corev1.Secret 1) drop use of corev1.ConfigMap - move ConfigMapWatcher to it's own package 1) Drop corev1 usage in knative.dev/pkg/tracing 1) drop use of corev1.ConfigMap

I tested changes 1-3 and removed tracing and metrics in order to see what the floor of completing 4-6 would be - that resulted in a queue-proxy binary size of 15MB.

There's probably a gain to be had when we removed it's copy of prometheus libs - ie. https://github.com/knative/serving/issues/11126

dprotaso commented 2 years ago

/unassign @mattmoor /unassign @evankanderson

qiming-007 commented 9 months ago

Maaaybe something like Rust, ... it does sound kind of fun as a proof of concept.

I mean envoy is written in c++ and has survived ok so.. rust may not be so bad (and it has worked well for linkerd)! I started a repo at http://github.com/julz/roo-proxy a while ago to experiment a bit (without much progress yet, tho!). One thing I'd really like to try is to see how much of QP we could reimplement as a wasm plug-in for envoy-- that way we could reuse all of the great work and optimisations and features in envoy and (for people who already have mesh) avoid needing a second sidecar at all.

Anyway, totally ack that there's a maintainability trade-off here, just this may be a case where it's warranted (empirically QP is very expensive for what it does), and rust is pretty ok.

@julz The topic is a little old, but I am wondering are there any findings/update for reimplementing QP as a wasm plug-in in envoy? The idea is promising even now.