Open fridex opened 3 years ago
Is your feature request related to a problem? Please describe.
As a Thoth user/operator, I would like to know how much time I need to wait to have resolved software stack available from the recommender system. To support this, we could expose an estimated time for an advise request to finish. As we have information about the maximum time allocated for advisers and information about the number of queued/pending/running advise requests, we can provide an estimation about the time needed to retrieve adviser requests from the system.
Describe the solution you'd like
Provide a metric that shows the estimated wait time for adviser to provide results. This can be later provided on user-api and shown to users (e.g. in thamos CLI).
The metric can be generalized for other jobs we run - package-extract, provenance-check, ...
Isn't workflow task latency something that gives an estimation already? We know on average the percentage of workflows successfull in a certain duration bucket, from the best case (< 5s) to worst case (> 900s). This issue can be more detailed dependening on the recommendation, number of packages, etc etc? wdyt?
Isn't workflow task latency something that gives an estimation already? We know on average the percentage of workflows successfull in a certain duration bucket, from the best case (< 5s) to worst case (> 900s).
If I understand this metric, it is more about putting tasks in a workflow into buckets so we have information about tasks and their duration.
This issue can be more detailed dependening on the recommendation, number of packages, etc etc? wdyt?
Might worth keeping this simple - even a request with one direct dependency can result in a huge resolved software stack. For all the recommendation types we assign a maximum number of CPU time that is allocated per request in the cluster - this is an upper boundary applied for all the recommendation types, only latest
can (but not necessarily has to) finish sooner. This upper boundary can be used to estimate how much time will be required to serve user requests based on the quota, resource allocation in adviser workflow.
Example:
We know we can serve 5 requests in parallel in the backend namespace. Users scheduled 10 advisers.
If there is allocated 15 minutes per advise request in the cluster - first 5 advisers finish in 15 minutes, the other 5 advisers finish in 30 minutes (15+15) - having system in this state means that a possible 11th request coming to the system will be satisfied in 45 minutes (but can be satisfied sooner, but also later - see below) - this is the $SUBJ metric.
As we run also kebechet in the backend namespace, things might get complicated if the namespace is polluted with kebechet pods. But having that estimation (and possibly improve it) can still be something valuable so we see how the system behaves, what resource allocation we need to sanely satisfy the userbase we have (estimating SLA).
@pacospace is this good to go or do we still need information? If you are happy feel free to change prio etc...
/sig observability
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
/close
@sesheta: Closing this issue.
/reopen /remove-lifecycle rotten
@fridex: Reopened this issue.
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
/close
@sesheta: Closing this issue.
@pacospace could this be another data driven development topic?
@pacospace could this be another data driven development topic?
Sure sounds good!
/project observability
/sig observability
Isn't workflow task latency something that gives an estimation already? We know on average the percentage of workflows successfull in a certain duration bucket, from the best case (< 5s) to worst case (> 900s).
If I understand this metric, it is more about putting tasks in a workflow into buckets so we have information about tasks and their duration.
This issue can be more detailed dependening on the recommendation, number of packages, etc etc? wdyt?
Might worth keeping this simple - even a request with one direct dependency can result in a huge resolved software stack. For all the recommendation types we assign a maximum number of CPU time that is allocated per request in the cluster - this is an upper boundary applied for all the recommendation types, only
latest
can (but not necessarily has to) finish sooner. This upper boundary can be used to estimate how much time will be required to serve user requests based on the quota, resource allocation in adviser workflow.Example:
We know we can serve 5 requests in parallel in the backend namespace. Users scheduled 10 advisers.
If there is allocated 15 minutes per advise request in the cluster - first 5 advisers finish in 15 minutes, the other 5 advisers finish in 30 minutes (15+15) - having system in this state means that a possible 11th request coming to the system will be satisfied in 45 minutes (but can be satisfied sooner, but also later - see below) - this is the $SUBJ metric.
As we run also kebechet in the backend namespace, things might get complicated if the namespace is polluted with kebechet pods. But having that estimation (and possibly improve it) can still be something valuable so we see how the system behaves, what resource allocation we need to sanely satisfy the userbase we have (estimating SLA).
what about
n_p = number of parallel workflows that can run in the namespace (backend)
workflows running in backend
n_a = number of adviser workflows running n_k = number of kebechet workflows running n_pc = number of provenance checker running
n_p = n_a + n_k + n_pc
tav_a = average time adviser workflows runs tav_k = average time kebechet workflows runs tav_pc = average time provenance checker runs
t_wait_time_advise = tav_a x n_a + tav_k x n_k + tav_pc x n_pc = tav_a x (n_p - n_k - n_pc) + tav_k x (n_p - n_a - n_pc) + tav_pc x (n_p - n_k - n_a)
all those metrics are already available in Prometheus, so we can estimate that.
Sounds good. What about counting also requests that are queued?
Sounds good. What about counting also requests that are queued?
I have to check how to get that number from Kafka, but in theory we can do that yes! and do we want to provide this information at user-API level?
t_wait_time_advise = tav_a x n_a + tav_k x n_k + tav_pc x n_pc = tav_a x (n_p - n_k - n_pc) + tav_k x (n_p - n_a - n_pc) + tav_pc x (n_p - n_k - n_a)
all those metrics are already available in Prometheus, so we can estimate that.
based on @fridex suggestion:
t_wait_time_advise = tav_a x kafka_adviser_requests_queued + (tav_a x (n_p - n_k - n_pc) + tav_k x (n_p - n_a - n_pc) + tav_pc x (n_p - n_k - n_a)
kafka_adviser_requests_queued = number of adviser message requests queued in Kafka.
t_wait_time_advise = tav_a x n_a + tav_k x n_k + tav_pc x n_pc = tav_a x (n_p - n_k - n_pc) + tav_k x (n_p - n_a - n_pc) + tav_pc x (n_p - n_k - n_a) all those metrics are already available in Prometheus, so we can estimate that.
based on @fridex suggestion:
t_wait_time_advise = tav_a x kafka_adviser_requests_queued + (tav_a x (n_p - n_k - n_pc) + tav_k x (n_p - n_a - n_pc) + tav_pc x (n_p - n_k - n_a)
kafka_adviser_requests_queued = number of adviser message requests queued in Kafka.
Based on a conversation with @KPostOffice, we should consider:
kafka_adviser_requests_queued = _get_current_offset_from_strimzi_metrics() - _get_investigator_consumer_offset()
using metrics from Strimzi, moreover @KPostOffice pointed out that is important to take into account partitions:
(current_offset_p1 - consumer_offset_p1) + (current_offset_p2 - consumer_offset_p2) + ... + (current_offset_pN - consumer_offset_pN)
Sounds interesting 👍🏻 It might be a good idea to discuss this at the tech talk.
@harshad16, are strimzi metrics collected by Prometheus in smaug and aws?
@pacospace sorry for missing your question here. i would have to check on this, maybe we need to create the service monitor for this.
One method solve this to calculate with kafka queue
and adviser workflow schedule per hour.
Acceptance criteria
[thoth user metric](https://github.com/thoth-station/thoth-application/blob/master/grafana-dashboard/base/thoth-service-metrics.json)
.Reference:
kafka_log_log_logendoffset{topic_name="aws-prod.thoth.kebechet-trigger"} -current_partition_offsets{topic_name="aws-prod.thoth.kebechet-trigger"}
/triage accepted
Is your feature request related to a problem? Please describe.
As a Thoth user/operator, I would like to know how much time I need to wait to have resolved software stack available from the recommender system. To support this, we could expose an estimated time for an advise request to finish. As we have information about the maximum time allocated for advisers and information about the number of queued/pending/running advise requests, we can provide an estimation about the time needed to retrieve adviser requests from the system.
Describe the solution you'd like
Provide a metric that shows the estimated wait time for adviser to provide results. This can be later provided on user-api and shown to users (e.g. in thamos CLI).
The metric can be generalized for other jobs we run - package-extract, provenance-check, ...