thoth-station / metrics-exporter

This is a Prometheus exporter for Thoth.
GNU General Public License v3.0
2 stars 10 forks source link

[8pt] Estimate wait time for advise requests #727

Open fridex opened 3 years ago

fridex commented 3 years ago

Is your feature request related to a problem? Please describe.

As a Thoth user/operator, I would like to know how much time I need to wait to have resolved software stack available from the recommender system. To support this, we could expose an estimated time for an advise request to finish. As we have information about the maximum time allocated for advisers and information about the number of queued/pending/running advise requests, we can provide an estimation about the time needed to retrieve adviser requests from the system.

Describe the solution you'd like

Provide a metric that shows the estimated wait time for adviser to provide results. This can be later provided on user-api and shown to users (e.g. in thamos CLI).

The metric can be generalized for other jobs we run - package-extract, provenance-check, ...

pacospace commented 3 years ago

Is your feature request related to a problem? Please describe.

As a Thoth user/operator, I would like to know how much time I need to wait to have resolved software stack available from the recommender system. To support this, we could expose an estimated time for an advise request to finish. As we have information about the maximum time allocated for advisers and information about the number of queued/pending/running advise requests, we can provide an estimation about the time needed to retrieve adviser requests from the system.

Describe the solution you'd like

Provide a metric that shows the estimated wait time for adviser to provide results. This can be later provided on user-api and shown to users (e.g. in thamos CLI).

The metric can be generalized for other jobs we run - package-extract, provenance-check, ...

Isn't workflow task latency something that gives an estimation already? We know on average the percentage of workflows successfull in a certain duration bucket, from the best case (< 5s) to worst case (> 900s). This issue can be more detailed dependening on the recommendation, number of packages, etc etc? wdyt?

fridex commented 3 years ago

Isn't workflow task latency something that gives an estimation already? We know on average the percentage of workflows successfull in a certain duration bucket, from the best case (< 5s) to worst case (> 900s).

If I understand this metric, it is more about putting tasks in a workflow into buckets so we have information about tasks and their duration.

This issue can be more detailed dependening on the recommendation, number of packages, etc etc? wdyt?

Might worth keeping this simple - even a request with one direct dependency can result in a huge resolved software stack. For all the recommendation types we assign a maximum number of CPU time that is allocated per request in the cluster - this is an upper boundary applied for all the recommendation types, only latest can (but not necessarily has to) finish sooner. This upper boundary can be used to estimate how much time will be required to serve user requests based on the quota, resource allocation in adviser workflow.

Example:

We know we can serve 5 requests in parallel in the backend namespace. Users scheduled 10 advisers.

If there is allocated 15 minutes per advise request in the cluster - first 5 advisers finish in 15 minutes, the other 5 advisers finish in 30 minutes (15+15) - having system in this state means that a possible 11th request coming to the system will be satisfied in 45 minutes (but can be satisfied sooner, but also later - see below) - this is the $SUBJ metric.

As we run also kebechet in the backend namespace, things might get complicated if the namespace is polluted with kebechet pods. But having that estimation (and possibly improve it) can still be something valuable so we see how the system behaves, what resource allocation we need to sanely satisfy the userbase we have (estimating SLA).

goern commented 3 years ago

@pacospace is this good to go or do we still need information? If you are happy feel free to change prio etc...

/sig observability

sesheta commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

sesheta commented 3 years ago

@sesheta: Closing this issue.

In response to [this](https://github.com/thoth-station/metrics-exporter/issues/727#issuecomment-894289498): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
fridex commented 3 years ago

/reopen /remove-lifecycle rotten

sesheta commented 3 years ago

@fridex: Reopened this issue.

In response to [this](https://github.com/thoth-station/metrics-exporter/issues/727#issuecomment-895010351): >/reopen >/remove-lifecycle rotten Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
sesheta commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

sesheta commented 3 years ago

@sesheta: Closing this issue.

In response to [this](https://github.com/thoth-station/metrics-exporter/issues/727#issuecomment-922458199): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
goern commented 3 years ago

@pacospace could this be another data driven development topic?

pacospace commented 3 years ago

@pacospace could this be another data driven development topic?

Sure sounds good!

goern commented 2 years ago

/project observability

goern commented 2 years ago

/sig observability

pacospace commented 2 years ago

Isn't workflow task latency something that gives an estimation already? We know on average the percentage of workflows successfull in a certain duration bucket, from the best case (< 5s) to worst case (> 900s).

If I understand this metric, it is more about putting tasks in a workflow into buckets so we have information about tasks and their duration.

This issue can be more detailed dependening on the recommendation, number of packages, etc etc? wdyt?

Might worth keeping this simple - even a request with one direct dependency can result in a huge resolved software stack. For all the recommendation types we assign a maximum number of CPU time that is allocated per request in the cluster - this is an upper boundary applied for all the recommendation types, only latest can (but not necessarily has to) finish sooner. This upper boundary can be used to estimate how much time will be required to serve user requests based on the quota, resource allocation in adviser workflow.

Example:

We know we can serve 5 requests in parallel in the backend namespace. Users scheduled 10 advisers.

If there is allocated 15 minutes per advise request in the cluster - first 5 advisers finish in 15 minutes, the other 5 advisers finish in 30 minutes (15+15) - having system in this state means that a possible 11th request coming to the system will be satisfied in 45 minutes (but can be satisfied sooner, but also later - see below) - this is the $SUBJ metric.

As we run also kebechet in the backend namespace, things might get complicated if the namespace is polluted with kebechet pods. But having that estimation (and possibly improve it) can still be something valuable so we see how the system behaves, what resource allocation we need to sanely satisfy the userbase we have (estimating SLA).

what about

n_p = number of parallel workflows that can run in the namespace (backend)

workflows running in backend

n_a = number of adviser workflows running n_k = number of kebechet workflows running n_pc = number of provenance checker running

n_p = n_a + n_k + n_pc

tav_a = average time adviser workflows runs tav_k = average time kebechet workflows runs tav_pc = average time provenance checker runs

t_wait_time_advise = tav_a x n_a + tav_k x n_k + tav_pc x n_pc = tav_a x (n_p - n_k - n_pc) + tav_k x (n_p - n_a - n_pc) + tav_pc x (n_p - n_k - n_a)

all those metrics are already available in Prometheus, so we can estimate that.

fridex commented 2 years ago

Sounds good. What about counting also requests that are queued?

pacospace commented 2 years ago

Sounds good. What about counting also requests that are queued?

I have to check how to get that number from Kafka, but in theory we can do that yes! and do we want to provide this information at user-API level?

pacospace commented 2 years ago

t_wait_time_advise = tav_a x n_a + tav_k x n_k + tav_pc x n_pc = tav_a x (n_p - n_k - n_pc) + tav_k x (n_p - n_a - n_pc) + tav_pc x (n_p - n_k - n_a)

all those metrics are already available in Prometheus, so we can estimate that.

based on @fridex suggestion:

t_wait_time_advise = tav_a x kafka_adviser_requests_queued + (tav_a x (n_p - n_k - n_pc) + tav_k x (n_p - n_a - n_pc) + tav_pc x (n_p - n_k - n_a)

kafka_adviser_requests_queued = number of adviser message requests queued in Kafka.

pacospace commented 2 years ago

t_wait_time_advise = tav_a x n_a + tav_k x n_k + tav_pc x n_pc = tav_a x (n_p - n_k - n_pc) + tav_k x (n_p - n_a - n_pc) + tav_pc x (n_p - n_k - n_a) all those metrics are already available in Prometheus, so we can estimate that.

based on @fridex suggestion:

t_wait_time_advise = tav_a x kafka_adviser_requests_queued + (tav_a x (n_p - n_k - n_pc) + tav_k x (n_p - n_a - n_pc) + tav_pc x (n_p - n_k - n_a)

kafka_adviser_requests_queued = number of adviser message requests queued in Kafka.

Based on a conversation with @KPostOffice, we should consider: kafka_adviser_requests_queued = _get_current_offset_from_strimzi_metrics() - _get_investigator_consumer_offset() using metrics from Strimzi, moreover @KPostOffice pointed out that is important to take into account partitions:

(current_offset_p1 - consumer_offset_p1) + (current_offset_p2 - consumer_offset_p2) + ... + (current_offset_pN - consumer_offset_pN)
fridex commented 2 years ago

Sounds interesting 👍🏻 It might be a good idea to discuss this at the tech talk.

pacospace commented 2 years ago

@harshad16, are strimzi metrics collected by Prometheus in smaug and aws?

harshad16 commented 2 years ago

@pacospace sorry for missing your question here. i would have to check on this, maybe we need to create the service monitor for this.

harshad16 commented 2 years ago

One method solve this to calculate with kafka queue and adviser workflow schedule per hour.

Acceptance criteria

Reference:

harshad16 commented 2 years ago

/triage accepted