Scheduled query failures should be easier to find and understand

fbertsch commented 6 years ago

It's hard to track scheduled queries failing, because not only do they not alert, but the errors never show!

I tested this with a failing query here: https://sql.telemetry.mozilla.org/queries/50316/source

To fix this, we should offer:

Error messages for failed scheduled queries
Email alerts on scheduled query failure (should be able to turn these off)
- Failure emails should probably be configured differently for different query types, e.g. ones that run every 5 mins. vs every week
Scheduled query failures could alert in re:dash (e.g. a flag notification at the top of re:dash), which when clicked takes the user to a dashboard of failed scheduled queries

Bug 1432317 was a duplicate of this.

jezdez commented 6 years ago

@fbertsch This is a non-trivial set of features that we should not add to our fork but see if upstream is interested before we start implementing.

@alison985 Please file the issues upstream and get feedback wether this is in-scope of upstream or whether we should start looking to implement it as an extension in redash-stmo.

alison985 commented 6 years ago

Started an issue at https://github.com/getredash/redash/issues/2593

alison985 commented 6 years ago

Arik is for this but wants to discuss implementation first.

@jezdez, for review before I post in the upstream issue, I would propose: Priority 1:

A new page for a user to see scheduled query runs of queries they own or have permissions to modify. Let's call this /scheduled-query-list. This would be a version of http://0.0.0.0:5000/admin/queries/tasks#done filtered to queries the user owns or can modify.
- I think this should be linked to in the main menu as well as the alert discussed below because it adds the nice feature of being able to check on your scheduled queries in general. I would propose to put it under the "Edit Profile" link in the menu under the user's icon and name.

Priority 2:

Email alerts on query failure. In the Refresh Schedule dialog we would add field for how often you want error emails. The options should be, at least for this first implementation, "every run", "once an hour", "once a week", "once a month". Email would be a link back to /scheduled-query-list and include the name of the query.

Priority 3:

An alert at the top of pages linking to the /scheduled-query-list page when there is a "failed" State for a query in the last month(?).
- I thought about since they last logged in but since we don't know what page they will be seeing and you have to do a database query to get that piece of information I don't want to add that overhead to every page load.

Priority 4:

Add a red warning icon to the view source view of the query page if the query schedule has failed.
- I think the easiest thing to implement would be if the query has failed ever show the red icon and a link to the /scheduled-query-list page, however, I can imagine that would start getting ignored. I think, ideally, it would make sense for it to be for "this version" of the query but the query version feature doesn't exist upstream yet. I think failure in X amount of time is reasonable but I think the X amount of time would have to be based on the refresh schedule of the query. Perhaps "if failure within the last 2 cycles+ 1 second" then show the red icon.

jezdez commented 6 years ago

@alison985 Thank you for creating the list of priorities for this new feature, and I agree this is a good idea to solve in a discussion. A bit of feedback:

Prio 1: I'm concerned that the proposed list of scheduled query runs is way too technical for the most common usage pattern and creates more noise than signal for the users, especially if it's offered right next to a high level link like the profile link.

Since users don't have a personalized "homepage" or "dashboard" page it's hard to imagine where the page would be linked from instead though. Instead I'd like to propose to simply move the list of query runs to the individual query detail page (e.g. under /queries//runs) and link to that from the (Prio 2) emails. That should cut down the number of items in the table dramatically and remove it from attention unless a user looks for the specific runs.

To close the attention gap in case a user's query runs fail, I like Prio 3 to be shown on any page, basically introducing a way to track important notifications like failure states with a dropdown or similar (not a full list to simplify query time). I'd be interested to hear what upstream thinks about adding such a section or if it'd be overkill.

Prio 3: Good point on the query time for alerts, we can think about async ways to load that list of alerts though, and optimize on the backend by using Redis' Sorted Set data type for caching (for example).

Prio 4: Nice, if the query runs move next to the view source view of a query, then it's even easier to link from one to the other to indicate that something may be wrong with the query.

Thanks again, this is nicely done, let's see what upstream thinks!

jezdez commented 6 years ago

Oh I forgot two things:

Prio 2: Let's implement "every run" only for now, since I'm not sure if the other patterns of sending error mails would also skew the noise to signal ratio towards noise. The worst error mails are the ones that get ignored by their recipient in my experience: 😬

Prio 3: I'm not sure about limiting the alerts to 1 month either. Let's raise that when you propose the details upstream, ok?

alison985 commented 6 years ago

@jezdez questions: A) When you say "the list of query runs to the individual query detail page (e.g. under /queries//runs)" there is not currently a /queries/#/runs page. Are you saying to create it and use that instead of the /scheduled-query-list page I propose above? Or are you talking about the created and updated timestamp section of the query page itself? Or something else? B) When you say "introducing a way to track important notifications like failure states with a dropdown or similar (not a full list to simplify query time)." I don't know if I follow what you mean by "dropdown". Are you trying to say that we should show that you have notifications but only make them show up in a dropdown of issues when the dropdown is selected? Or something else?

jezdez commented 6 years ago

@alison985

A) yep, a new page for the runs

B) "dropdown" is probably the wrong term, I mean a notification section that is capable to inform users of current events for their user accounts, similar to how GitHub for example shows a small bell icon in the top header area. I guess having some of the notifications show up as a dismissable pop-up depending on the severity of the event would make sense additionally, too.

alison985 commented 5 years ago

Final Plan with all Feedback/edits merged

Priority 1 (THIS ticket):

Store explicit metadata on failed scheduled queries: last successful execution time, last error time, number of error since last successful execution, last error message (or maybe all the errors?). This will be cleared out when the query executes normally again. Once we have this extra information we can use it in the email.
Email alerts on query failure. In the Refresh Schedule dialog we would add field for how often you want error emails. The option should be, at least for this first implementation, "every run". In the future, other options could be: "once an hour", "once a week", "once a month". Email would include the name of the query.

Priority/Ticket 2:

A new page for a user to see scheduled query runs of that query. Let's call this /queries/#/runs. This would be a version of http://0.0.0.0:5000/admin/queries/tasks#done filtered to that specific query. It would only be visible to users with access to that query.
Update email to have a link back to the /queries/#/runs page.

Priority/Ticket 3:

A notification alert at the top of pages to identify there is an issue to look at. On clicking it a list of links to the /queries/#/runs page(s) when there is a "failed" state for a query in the last month(?). (Is the last 30 days an appropriate window?)
- I thought about since they last logged in but since we don't know what page they will be seeing and you have to do a database query to get that piece of information I don't want to add that overhead to every page load.
- From @jezdez: "we can think about async ways to load that list of alerts though, and optimize on the backend by using Redis' Sorted Set data type for caching (for example)."
- From Arik: "Also once we have this failure metadata we can implement the notification(s) in the UI. Although this requires some extra thought on how/when exactly we will show this, but I think it's better to work out the details when we get closer to implementing this."

Priority/Ticket 4:

Add a red warning icon to the view source view of the query page if the query schedule has failed.
I think the easiest thing to implement would be if the query has failed ever show the red icon and a link to the /queries/#/runs page, however, I can imagine that would start getting ignored. I think, ideally, it would make sense for it to be for "this version" of the query but the query version feature doesn't exist upstream yet. I think failure in X amount of time is reasonable but I think the X amount of time would have to be based on the refresh schedule of the query. Perhaps "if failure within the last 2 cycles+ 1 second" then show the red icon.

alison985 commented 5 years ago

First PR upstream: https://github.com/getredash/redash/pull/3065

mozilla / redash

Scheduled query failures should be easier to find and understand #309

Final Plan with all Feedback/edits merged