r-dbi / odbc

Connect to ODBC databases (using the DBI interface)
https://odbc.r-dbi.org/
Other
392 stars 107 forks source link

Document `EnableQueryResultDownload` for `databricks()` #856

Open MCMaurer opened 1 month ago

MCMaurer commented 1 month ago

Without setting EnableQueryResultDownload='0' in the Databricks connection string, queries of > ~40k rows will return no data and a relatively cryptic error message:

[Simba][Hardy] (35) Error from server: error code: '0' error message: '[Simba][Hardy] (134) File fe48a1be-0a4e-482e-aa09-3ec21215756d: A retriable error occurred while attempting to download a result file from the cloud store but the retry limit had been exceeded. Error Detail: File fe48a1be-0a4e-482e-aa09-3ec21215756d: An error had occurred while attempting to download the result file.The error is: Couldn't resolve host name. Since the connection has been configured to consider all result file download as retriable errors, we will attempt to retry the download.'.

However, all it takes to solve this is adding EnableQueryResultDownload='0' to the connection string or DBI() call. Then all rows will be returned.

This isn't found anywhere within Databricks' documentation, and figuring it out involves a fair bit of Stack Overflow sleuthing. It's also very different behavior compared to many other DBI connections, which could lead to extra confusion. I think it would be helpful to include this tip in the documentation for odbc::databricks().

atheriel commented 1 month ago

Are there any potential downsides to setting this by default?

MCMaurer commented 1 month ago

I couldn't find much about the setting, given that it's not even in Databricks' own documentation. I would be hesitant to override Databricks' default behavior though, since it also applies to pyodbc and other ODBC connections. I think keeping the default behavior but adding it to the list of documented arguments would be a good balance.

simonpcouch commented 1 month ago

It looks like EnableQueryResultDownload toggles "Cloud Fetch", which is some sort of query optimization technique for BI tools. I think I agree that it's probably not optimal for us to set this by default.

I'd be tempted to try and catch + rethrow this error with a more informative message, telling users to set that parameter. There's not much in this error message that feels specific to this issue, but it does seem like "[Simba][Hardy] (134)" and "cloud store but the retry limit" are enough to narrow down the search results to only those with a solution pointing to EnableQueryResultDownload.

MCMaurer commented 1 month ago

I think throwing a more informative error and suggesting this setting would be great. You're right that with some good error message copying and googling, a relatively savvy user could figure it out, but I figure it wouldn't hurt to save folks that step.

I'm thinking just from a user perspective, the expectation is that the success of query made via DBI/odbc won't depend on the number of rows returned, other than the possibility of running out of memory in R. The default for databricks() doesn't meet this expectation, which isn't the worst thing, as long as users can pretty quickly get to the expected behavior.

It's also weird that while Cloud Fetch is mentioned wrt Databricks ODBC in the Azure documentation:

Cloud Fetch is only used for query results larger than 1 MB. Smaller results are retrieved directly from Azure Databricks.

There is no mention of the parameter that needs to be set in the connection string.

hadley commented 1 month ago

We've pinged out internal databricks contacts, so we'll wait to hear back from them before we implement anything.

zacdav-db commented 1 month ago

@MCMaurer we don't recommend setting this by default as its a significant performance improvement for larger result sets. Additionally, its not something we publicly document as we consider this setting a last resort if the underlying issue cannot be resolved.

We'd love to help get to the bottom of the underlying cause for the data not appearing, which from the error, is likely networking related.

... The error is: Couldn't resolve host name ...

If you have a dedicated Databricks contact already, please reach out and submit a ticket, otherwise we can try arrange a way to get to bottom of it.

MCMaurer commented 1 month ago

Thanks all- just got contacted by our organization's Databricks rep. I'll update here if there's anything pertinent that comes up in our conversation.