Feature Request: Sleep when no requests are received for a certain period of time

kehralexander commented 1 month ago

I'd like to open this feature request to keep track and discuss the already discovered potential feature of sending the deployment(s) back to sleep.

Why?

Currently deployments will stay awake until the next "bedtime" after being awoken once. This behaviour is fine if the user is just very early, but not desired if the user only works late hours.

Considerations

I guess snorlax needs a way to be in between the ingress and the deployment? A pod that just proxies all the traffic comes to mind for that.

Other

This feature might also be helpful with not going to sleep if the deployment is still in use (recieving some amounts of traffic), even if it is bedtime.

kehralexander commented 1 month ago

For our environment and usecase (why this is a comment and not in the issue description above :smile:) an additonal pod is not very desired, as it would mean it's running all the time and effectively would cost us money. Yet I think this could be mitigated by making the proxy pod kind of extensible - we're already running an nginx deployment per app-instance, so putting this one and the snorlax-proxy together, howevery this might look like, would be handy for us, maybe also someone else.

But perhaps there's a better solution to monitor ingress traffic which does not involve some sort of proxy pod at all.

Akenatwn commented 1 month ago

I see 2 parts here, one being a kind of undesired behaviour (1) and one being a feature request (2):

The deployment gets put to sleep even though there is (ingress) traffic ongoing
Within the bedtime the deployment gets put to sleep after some predefined time of no (ingress) traffic

I can see how the implementation could cover both at once, but in case a choice has to be made about the order, I think the above would make sense.

azlyth commented 1 month ago

Yeah, this is something I've been thinking about. A sidecar proxy sounds like it could work.

And being able to sense inactivity to allow for more configuration would be useful.

Another possibility would be to parse default ingress controller logs to try and support the general case at first. Though one concern would be an ingress controller that's spitting out a lot of logs.

Example ingress-nginx log below, it contains [default-dummy-frontend-80] which is the[namespace]-[service]-[port]:

ingress-nginx-controller-768f948f8f-rz9hq controller 10.244.0.1 - - [07/Jun/2024:05:42:47 +0000] "GET / HTTP/1.1" 499 0 "http://localhost/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36" 1127 0.000 [default-dummy-frontend-80] [] 10.244.0.13:80 0 0.000 - c992c571eeae156e8391645049260ed3

kehralexander commented 1 month ago

Another possibility would be to parse default ingress controller logs to try and support the general case at first. Though one concern would be an ingress controller that's spitting out a lot of logs.

While this could indeed provide a plug&play solution, I think it would probably scale horribly. Something like that could be implemented if the logs are queried at something like a Loki or CloudWatch, but then again we're at something that would require a tooling/software that might not be available in the cluster -> not really generic.

azlyth commented 1 month ago

I tend to agree re: scale if an ingress controller handles a lot of traffic, but I'm not quite sure where that threshold is. Datadog agents are able to collect and send all logs that a busy ingress controller spits out, so maybe it's not too bad.

Re: the sidecar proxy, two things to think about are:

is there an existing proxy container out there that can do achieve what we need? Or do we implement a go proxy ourselves?
do we have users define the sidecar themselves? or can we use something like a mutating admission webhook to dynamically update the pod config? (we may not have enough port information in the pod spec, though, which would be tricky)

🤔

azlyth commented 1 month ago

Another way I'm considering is integrating with some service mesh (probably linkerd to start with). That way Snorlax would be able to query Prometheus with something like:

# Check HTTP requests for a specific deployment
sum(rate(linkerd_request_total{deployment="your-deployment"}[1m]))

So the changes would be:

Update Snorlax Helm values to have it deploy with linkerd and Prometheus (I'll have to see how resource overhead that would add to the Snorlax deployment)
Update Snorlax code to query the Prometheus server to determine inactivity (and implement some form of the logic we talked about above)

One part that I like about this design is that users only need to add the linkerd annotation on their workloads, and we let linkerd handle the sidecar / network proxying.

Update: One complication with ^ is that there are health check requests (e.g. ELB healthcheck) which would be counted.

azlyth commented 1 month ago

https://github.com/elazarl/goproxy 👀

kehralexander commented 1 month ago

[...] Snorlax would be able to query Prometheus [...]

This sounds like an excellent idea, how about - for a first implementation - making the prometheus query configurable? If applications already export request count metrics these simply could be used. No need for any proxy in that case.

azlyth commented 1 month ago

Yeah, it'd be a nice integration, but we would not be able to filter out health check requests from the request counts. In the environment I have snorlax deployed to, the deployments get http checks from the AWS ALB load balancer.

I've actually started working on the sidecar proxy, because I feel more comfortable with the idea of the sidecar proxy if a library is doing the heavy lifting. There are not-often-used HTTP edge cases I would rather not have to consider.

I'm aiming to get it done sometime within the next week.

moonbeam-nyc / snorlax