kubecost / cost-analyzer-helm-chart

Kubecost helm chart
http://kubecost.com/install
Apache License 2.0
466 stars 413 forks source link

Failed ETL Notifications #1152

Closed kirbsauce closed 7 months ago

kirbsauce commented 2 years ago

What problem are you trying to solve? I would like to be more proactive when ETL fails, as opposed to having to monitor diagnostics.html

Describe the solution you'd like A notification/alert when an ETL build fails

Describe alternatives you've considered Have considered polling /model/etl/status, but I haven't been able to find a success status, only "isRepairing". I've also considered looking at the "size", but it could result in misses or false positives.

How would users interact with this feature? Not terribly concerned with how to enable it, but I would expect the normal email and/or slack delivery method would be offered to maintain some consistency.

┆Issue is synchronized with this Jira Task by Unito

mbolt35 commented 2 years ago

Just want to chime in here and provide some further details as to why this particular issue is a bit more complicated that it may seem.

With respect to /model/etl/status, there are 2 very specific fields on each entry's status that would indicate a failure. If there exist a warning or error during the creation of the ETL entry, those fields would exist. This is an indicator of incomplete data, or a complete failure. To indicate empty data (an ETL day that was missing data due to metrics for that time range not existing - ie: I've been running prometheus for 14 days, I query for 21 days, 7 of those days will [legitimately] be empty), we check file size for < 200 bytes. While it does seem possible that this arbitrary byte count could yield a false result, it's not very probable.

Where this gets tricky is there are some cases where querying prometheus or thanos yields empty results where we would expect to receive results -- there are cases like this where a warning is returned No StoreAPI available but we handle this warning and propagate it. We've noticed this behavior more often when there are many store endpoints involved in data aggregation with thanos (larger scale). These cases are very difficult to identify without manual intervention and very hard to reproduce, which is why active monitoring for such a condition is a challenge.

kirbsauce commented 2 years ago

Thanks @mbolt35 , agreed. It certainly isn't as easy as one would think. I assumed the presence of "isRepairing": false meant that was the only status/state. Now that I think of it, there must be more information, at times, in /etl/status results as I expect the frontend leverages the same endpoint.

Currently, users are navigating to the diagnostics.html page to see if anything is in the RED, at which point they will attempt to repair it. Maybe the same Red/Yellow/Green logic could be used to trigger a notification?

dahamr commented 2 years ago

My company would also like some sort of alerting for etl build failures - especially since we currently have to intervene to get the etl rebuilt. Our primary alerting environment is prometheus so our preference is alerts based on promql queries.

MrColeC commented 2 years ago

We are also interested in this - which appears related to what is in https://github.com/kubecost/cost-model/issues/264

Adam-Stack-PM commented 2 years ago

@nikovacevic Did the work in v1.94 set us up to add this notification?

nikovacevic commented 2 years ago

@AdamStack18 it really depends on what we mean by "when ETL fails." We're probably in a marginally better place now. Mostly, this appears to be a matter of just deciding what constitutes a failure, then putting in the work to create new alert(s) of whatever kind people would like to see (e.g. email, Slack, etc.)

Adam-Stack-PM commented 2 years ago

Action Item: Create a list of events/situations that should trigger an ETL failure notification.

@kbrwn Would you be willing to help create that list?

CC @dwbrown2

Adam-Stack-PM commented 2 years ago

@kaelanspatel and @mbolt35, If we don't have it already during test and doc week could you help put together a consolidated list of the events and alerts we added?

keithhand commented 1 year ago

I just wanted to bump this as we have some additional interest in a way to report on potential ETL failures. Perhaps we could send an alert now with a message like "Potential issue; please investigate as it could be a false alarm" to try and prevent cases where we are picking up failure anomalies.

AjayTripathy commented 1 year ago

Let's group together these monitoring/alerting tasks and look to tackle in v1.99

kirbsauce commented 1 year ago

More customers are asking about this. @Adam-Stack-PM any chance this will be in 1.99?

Adam-Stack-PM commented 1 year ago

@kirbsauce Not going to be in v1.99 but I added it to the proposal list for v1.100.

teevans commented 1 year ago

@nikovacevic - Can I bother you to slot this for extra work at the end of this release? Would love to get it into 1.100 if possible

nikovacevic commented 1 year ago

Very low chance of this making it into 1.100, but I'll add it to the list. We need product vision here. cc @dwbrown2 and Kai (does Kai have a GH?)

nikovacevic commented 1 year ago

We also don't yet have a good way of determining whether not ETL has "failed" per se. See Bolt's comment above.

nikovacevic commented 1 year ago

In v1.100, I'll plan to move the needle here by at least (1) taking a best-effort guess at "when ETL failed" and (2) emitting a new event -- something like ETLAllocationComputeWarning -- but we'll need participation from others to take it from here to "notifications" in a user-facing sense.

The workaround, then, would be to poll /model/etl/logs and look for those warnings.

cc @kwombach12 @teevans

AjayTripathy commented 1 year ago

That seems ok if we can produce some documentation and send it to customers for a first pass. Can we add that to the ticket here when done?

dwbrown2 commented 1 year ago

@nikovacevic thoughts on making this more generic if we create a Prom metric? e.g. ETLComputeFailure? Then the event could have label(s) that provide more details, e.g. assets vs allocation, etc. Generally thinking is that as a consumer of these metrics, I'd like to keep my alerts as simple as possible.

nikovacevic commented 1 year ago

@dwbrown2 for a Prom metric, totally agree that simpler would be better. We can make the domain and status a label. Biggest thing is still just how to detect in the first place and differentiate from a rightly-empty data set... e.g. at start up, or rebuilding back past the Prom retention or something.

What do you think about making it less opinionated? Like kubecost_etl_event with statuses like error or empty or success? Then users can interpret empty themselves?

AjayTripathy commented 1 year ago

I think that's a really good idea as well.

dwbrown2 commented 1 year ago

I like this idea! Seems practically an alert would look/feel something like this?

Webb Kubecost, Co-founder

On Wed, Jan 25, 2023 at 2:47 PM Niko Kovacevic @.***> wrote:

@dwbrown2 https://github.com/dwbrown2 for a Prom metric, totally agree that simpler would be better. We can make the domain and status a label. Biggest thing is still just how to detect in the first place and differentiate from a rightly-empty data set... e.g. at start up, or rebuilding back past the Prom retention or something.

What do you think about making it less opinionated? Like kubecost_etl_event with statuses like error or empty or success? Then users can interpret empty themselves?

— Reply to this email directly, view it on GitHub https://github.com/kubecost/cost-analyzer-helm-chart/issues/1152#issuecomment-1404325442, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACI257ZXYOINRU43PPL3DDWUGUR3ANCNFSM5JCU52TQ . You are receiving this because you were mentioned.Message ID: @.***>

nikovacevic commented 1 year ago

@dwbrown2 it'd be a bit more complicated, but that's the gist yeah. Seems like a nice solution to me! (We already try to detect errors, and it doesn't work... more often Thanos is down and returns empty data w/o an error. So we'd be counting "unexpected increases in count of empty etl data" events.)

Spoke with @kwombach12 this morning about the product side.

dwbrown2 commented 1 year ago

Awesome, let me know if I can help in other areas. Still aiming for 1.100?

kwombach12 commented 1 year ago

@dwbrown2 it'd be a bit more complicated, but that's the gist yeah. Seems like a nice solution to me! (We already try to detect errors, and it doesn't work... more often Thanos is down and returns empty data w/o an error. So we'd be counting "unexpected increases in count of empty etl data" events.)

Spoke with @kwombach12 this morning about the product side.

My proposal would be to add a Warning to the Diagnostic Check notifications. The user could then click 'learn more' and be shown maybe the last 5 days of the ETL Status metric so they can see how the number of successes/empties/errors have changed. What do you think @nikovacevic? Is this doable in 1.100?

cc: @dwbrown2

nikovacevic commented 1 year ago

Going to start this afternoon, after reviews are done. Will get as far as I can in ~1h of work, and we can continue the work next week after code freeze as time permits. I'd expect a full feature (not just a new metric emission) to be v1.101 thing.

nikovacevic commented 1 year ago

Update: too much to review on code freeze day. This will have to wait until next week, or perhaps next release.

nikovacevic commented 1 year ago

No IC time. Planning for v1.101.

github-actions[bot] commented 7 months ago

This issue has been marked as stale because it has been open for 360 days with no activity. Please remove the stale label or comment or this issue will be closed in 5 days.

github-actions[bot] commented 7 months ago

This issue was closed because it has been inactive for 365 days with no activity.