sinkingpoint / prometheus-gravel-gateway

A Prometheus Aggregation Gateway for FAAS applications
GNU Lesser General Public License v3.0
115 stars 10 forks source link

Persistent state #4

Open kradalby opened 2 years ago

kradalby commented 2 years ago

Hi

I saw your talk on KubeCon EU and found this project quite interesting, one thing I was wondering how you are tackling in your use case (if it is needed) is persistent state.

As far as I can see in the readme or the code, there isnt anything written to disk and it is "more or less" stateless.

I assume this would mean that if the gateway is restarted then it will loose the metrics that was already located in there and you "risk" having a scrape with absent metrics?

How do you tackle this issue in your setup?, and is persisting the state something that you would consider useful? The state could for example be written to blob/s3 like storage periodically and allow it to start up again with data from the last session.

sinkingpoint commented 2 years ago

Hi there! Similar to other Push Gateways, we don't persist state here. The logic is that this behaves as sort of a proxy for your metrics - if you restart a standard service, then you lose state so we figured that it wasn't important here.

That being said, I think the idea of persisting state, at least for an opt-in subset of metrics is interesting. Did you have a usecase in mind?

kradalby commented 2 years ago

Similar to other Push Gateways, we don't persist state here. The logic is that this behaves as sort of a proxy for your metrics - if you restart a standard service, then you lose state so we figured that it wasn't important here.

I understand, we have been using the "original" Pushgateway which does allow you to define a persistent file. It is written to the file system which we have had varied success with when running on Kubernetes and with network storage.

That being said, I think the idea of persisting state, at least for an opt-in subset of metrics is interesting. Did you have a use case in mind?

Our use case is for batch and other types of long running jobs that have a more "loose" coupling to the push gateway, as in the push gateway will be a continuous proxy, but the jobs will come and go throughout the day.

The concern with having the push gateway being able to "recover" after a restart would be for very long running jobs that might not emit continuous metrics. In that scenario, if the push gateway is restarted, we might end in a scenario where the last metric emitted by a job is "lost" and marked absent by Prometheus since it is no longer in the push gateway.

I agree that this isnt necessary for every use case, so a opt-in sounds like a reasonable approach.

Blob/S3 is suggested to make the application a bit more resilient to things that attempt to mimic POSIX semantics (like a persistent kubernetes volume) but might fail because of networking etc.