Bound Concurrency of Pod Scraping (Probably with a Work Pool)

julz commented 4 years ago

Describe the feature

We've discussed this a few times in slack but it seems worth having an issue for it and I didn't spot one, so here one is :).

As of right now we spawn pod scrapes in parallel for ~every revision with no real bound on the amount we'll attempt at once across revisions. This means we can potentially have an extremely large number of active outgoing scrapes, causing more contention than needed, exceeding keep-alive connection pool size and buffer pools etc. In our larger environments, we see large numbers of connections waiting on DNS and GC when revisions*pods gets large.

We should probably introduce some sort of work pool for the scrapers.

markusthoemmes commented 4 years ago

In the same vain, we should likely switch the probes into a workqueue based solution (like net-XXX probers) to be able to more easily control the lifecycle of the respective probes.

vagababov commented 4 years ago

For the color revisions*pods is exactly 16*revisions, when pod count goes towards infinity, as in it's bounded in number of pods, but not in number of revisions.

vagababov commented 4 years ago

In the same vain, we should likely switch the probes into a workqueue based solution (like net-XXX probers) to be able to more easily control the lifecycle of the respective probes.

Here, I am a bit on the border, given that we probably have requests queueing up in activator that we want to start forwarding as fast as possible. And until the pod is probed it receives no requests... so 🤷

markusthoemmes commented 4 years ago

Not sure we talk about the same thing @vagababov? What does the activator have to do with pod probing in the autoscaler? :thinking:

vagababov commented 4 years ago

We only probe during scale to 0. If we consider 500 revisions all going to 0 at the same time, it's unfortunate, but a much more corner case than just 500 revisions that we need to scrape :)

vagababov commented 4 years ago

But we do a lot of probing in activator :)

markusthoemmes commented 4 years ago

Okay, we clarified offline: In my comment about switching to workqueue based stuff I meant metric scraping, not probing. Sorry for the confusion.

skonto commented 4 years ago

@julz @vagababov @markusthoemmes I could help here if possible.

markusthoemmes commented 4 years ago

Yep, just know that this might be a somewhat bigger refactor of the current scraper code.

I think we first need to agree that we actually want work-pooling here as it comes with the risk of us setting the concurrency too low.

julz commented 4 years ago

so how do we want to proceed on this? Would a feature proposal doc to hash out the details and any foreseeable edge cases be the best next step?

markusthoemmes commented 4 years ago

I believe so @julz

julz commented 4 years ago

cool - @skonto happy to work together on that if you like?

skonto commented 4 years ago

@julz sure let's do it :) I will start a doc and we can iterate on it. I will ping you.

vagababov commented 3 years ago

Note, that this has to be a configurable feature, since a user might wish to throw CPU at the AS to ensure enough resources to support more revisions. Finally, I'd like us to think about this with the scope of shared autoscaler that @yanweiguo is working on. As in, will creating more autoscalers solve this problem?

skonto commented 3 years ago

@vagababov, as you probably noticed we (cc @julz @markusthoemmes) had a long discussion on slack (autoscaling channel), so absolutely that is one option that we discussed, but still within the context of one instance there should be space for improvements. I will try to summarize here. What can be improved needs verification in any case.

The total max cost is max_scraped_pods_per_revision*number_of_revisions. The first factor is 16.

We could either make 16 less or limit the two factors at any time eg. control scraping or revisions served.

Using a workqueue for limiting scraping has some other benefits, besides limiting concurrency, for example as @markusthoemmes mentioned:

That'd enable us to easily enable/disable scraping for specific revision, which then in turn enables us to dynamically disable scraping if the activator is put into the routing path.

But also there are challenges as mentioned in #8610. Note here though that we already use a workqueue for probing.

In addition to that, the fact of currently having a max of 16 pods scraped per revision could be less assuming we have a smarter algorithm besides relying on the normal distribution. Maybe a time-series algo could do better here with less pods used in scraping.

The key question, as discussed, is whether we should limit the number of concurrently served revisions or how much scraping we do per revision at any time. For the later a related issue is this one. In that issue max keep alive was increased but this could lead to running out of file descriptors if we keep increasing it.

IMHO, from a UX and operations perspective we could make configurable both the number of revisions served and the workqueue concurrency or other options (if we go down that path). In analogy to how http servers limit requests we can set a global maximum anyway per AS instance with some good default. Also we could think of more advanced concepts like QoS and priorities per group of revisions (not sure if that concept exists at the moment).

skonto commented 3 years ago

@vagababov @julz should we move on with a workqueue approach, have a first version and take it from there?

markusthoemmes commented 3 years ago

We might want to think about the guarding measures we want to take first maybe. That makes accepting/denying this a no-brainer imo.

How do we want to measure success/failure or rather better/worse once we do this?

skonto commented 3 years ago

@markusthoemmes from my POV, my definition of success would be to add configuration capabilities keeping the same performance as in the unbounded case and avoid corner cases as described in #8610.

markusthoemmes commented 3 years ago

Hm, but wouldn't we want to show that implementing this generates an advantage over the existing way of doing things (which doesn't require hand-tuning either).

We might want to produce a micro-benchmark that gives us an idea on the scale needed to reap benefits and the scale needed to push things so far that they start to detriment.

skonto commented 3 years ago

The way I see this is that the workqueue imposes a rate limiting strategy in order to optimize revision scraping throughput and latency. The current approach from operations perspective implicitly requires hand-tuning if you have to increase your connection pool size in large envs and btw there is no control with busty traffic. Setting a configurable limit brings its own value that is my POV if it protects you from failures, keeps system stable and helps with making progress with pending processing eg. revision processing. For the latter we had the question about what matters first revision progress or scraping process in general.
So, yes we also need to show how the workqueue will help us without increasing connection pools in large envs, because if at the end of the day the unbounded approach makes progress and just queues up work then who cares. However I suspect there are issues already eg. GC as stated in the description.

vagababov commented 3 years ago

I would also want to see how StatefulSet AS scaleout works, since I don't want us to compilicate the system if there's no necessity to.

skonto commented 3 years ago

@vagababov reading these benchmarks from the ants lib shows that most likely unbounded goroutines (independently from scraping) are not the best choice beyond a concurrency number eg. 100K tasks in that benchmark. So I think we need a better approach anyway because I suspect that although scaling out would solve the problem, an optimized instance could lead to less replicas when needed.

vagababov commented 3 years ago

100K go routines is probably strongly above what we support per task in any case....

skonto commented 3 years ago

@julz describes an issue that occurs in large envs. So we need to clarify what is the number of pods being scraped when issues occur. Btw in theory if we follow the guidelines for building large clusters, we could have thousands of pods and the limit is 150K. However, these benchmarks I mentioned depend on the processing being done which is a dummy function. I dont think the threshold is the same with a realistic workload. In the latter case cost per scraping request could be way more in ms, could use more resources because of http connections in exceeded keep-alive pools etc . I think a benchmark could shed more light on this as also suggested by @markusthoemmes (assuming its realistic enough).

markusthoemmes commented 3 years ago

Yeah let's write a Golang benchmark (I don't think an E2E one is necessary) and see if we can find a break even point between just spawning new goroutines and pooling them for our workloads.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

nader-ziada commented 2 years ago

/reopen

/lifecycle frozen

knative-prow[bot] commented 2 years ago

@nader-ziada: Reopened this issue.

In response to [this](https://github.com/knative/serving/issues/8377#issuecomment-1129187597): >/reopen > >/lifecycle frozen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

dprotaso commented 1 year ago

/triage accepted /milestone Icebox

knative / serving

Bound Concurrency of Pod Scraping (Probably with a Work Pool) #8377

Describe the feature