grafana / sqlds

A package that assists writing SQL-driven datasources
Apache License 2.0
17 stars 12 forks source link

Don't retry queries that have exceeded context deadline #63

Closed codeincarnate closed 2 years ago

codeincarnate commented 2 years ago

This PR fixes an error in which retried queries which themselves return an error are not processed correctly. This can happen on SQL data sources with long running queries that time out (e.g. Snowflake). Resulting response will cause the dashboard to think that frames have been return (even though there aren't any).

The included changes cause the error to be returned sans data frames so that the front-end error is correctly triggered.

codeincarnate commented 2 years ago

My apologies, I should have been more clear. I do think this replicates the behavior on non-retried queries, and currently if the query doesn't get retried we return nil, err in handleQuery

https://github.com/grafana/sqlds/blob/bda53fbeb95f56e03616649b8cf423a072ec26c3/datasource.go#L224

However currently on 222 query is called directly. As far as I can tell, this behavior differs slightly from the non-retried case and will return res, err (where res is an empty data frame) instead of simply nil, err.

In the cases I'm testing it does result in different results. This one will cause the front-end to hang (original):

sf-without-change

With the change, I get the following and the front-end will correctly show error:

sf-with-change

Is this some area front-end wise where the first result might be having trouble? I'm not sure if that should be considered a valid result in this case. Happy to provide more details as well.

andresmgot commented 2 years ago

Ah, I didn't notice that at the end it was not returning the res but I believe that's intentional. Note that we are checking here:

https://github.com/grafana/sqlds/blob/bda53fbeb95f56e03616649b8cf423a072ec26c3/datasource.go#L214

if the error has been produced by executing the query (which is the normal case) and in that case, we retry, returning the error and the single frame with the metadata you are seeing.

In other case, like an internal error while trying to transform rows to frames, we return an empty response (no frames).

Checking this with Athena, the frontend should be able to handle the case in which there is an error and a frame:

Screenshot from 2022-03-16 17-54-02

so you probably need to check why the plugin frontend is hanging on that situation

codeincarnate commented 2 years ago

@andresmgot ah that makes sense. Thanks for the explanation! It does appear to be something on the front-end. Talked with @scottlepp and this is stumbling on another problem (queries are retried even if context deadline wash reached) so going to adjust this one.

codeincarnate commented 2 years ago

Ok, updated to simply check that the error isn't a context deadline error in addition to seeing if it is a query error. In my testing appears to be working.

codeincarnate commented 2 years ago

The change makes sense now. E.g. do not retry if the deadline has been exceeded already but I think that your original issue may not have been solved properly.

If there is a query error (like a table not found), you are still going to get an error and a single frame with metadata (and not sure if that will cause your plugin to hang).

It can cause a hang with certain transformations, but I think the errors are there and not here.