Open MCMaurer opened 1 month ago
Are there any potential downsides to setting this by default?
I couldn't find much about the setting, given that it's not even in Databricks' own documentation. I would be hesitant to override Databricks' default behavior though, since it also applies to pyodbc
and other ODBC connections. I think keeping the default behavior but adding it to the list of documented arguments would be a good balance.
It looks like EnableQueryResultDownload
toggles "Cloud Fetch", which is some sort of query optimization technique for BI tools. I think I agree that it's probably not optimal for us to set this by default.
I'd be tempted to try and catch + rethrow this error with a more informative message, telling users to set that parameter. There's not much in this error message that feels specific to this issue, but it does seem like "[Simba][Hardy] (134)"
and "cloud store but the retry limit"
are enough to narrow down the search results to only those with a solution pointing to EnableQueryResultDownload
.
I think throwing a more informative error and suggesting this setting would be great. You're right that with some good error message copying and googling, a relatively savvy user could figure it out, but I figure it wouldn't hurt to save folks that step.
I'm thinking just from a user perspective, the expectation is that the success of query made via DBI/odbc won't depend on the number of rows returned, other than the possibility of running out of memory in R. The default for databricks()
doesn't meet this expectation, which isn't the worst thing, as long as users can pretty quickly get to the expected behavior.
It's also weird that while Cloud Fetch is mentioned wrt Databricks ODBC in the Azure documentation:
Cloud Fetch is only used for query results larger than 1 MB. Smaller results are retrieved directly from Azure Databricks.
There is no mention of the parameter that needs to be set in the connection string.
We've pinged out internal databricks contacts, so we'll wait to hear back from them before we implement anything.
@MCMaurer we don't recommend setting this by default as its a significant performance improvement for larger result sets. Additionally, its not something we publicly document as we consider this setting a last resort if the underlying issue cannot be resolved.
We'd love to help get to the bottom of the underlying cause for the data not appearing, which from the error, is likely networking related.
... The error is: Couldn't resolve host name ...
If you have a dedicated Databricks contact already, please reach out and submit a ticket, otherwise we can try arrange a way to get to bottom of it.
Thanks all- just got contacted by our organization's Databricks rep. I'll update here if there's anything pertinent that comes up in our conversation.
Without setting
EnableQueryResultDownload='0'
in the Databricks connection string, queries of > ~40k rows will return no data and a relatively cryptic error message:However, all it takes to solve this is adding
EnableQueryResultDownload='0'
to the connection string or DBI() call. Then all rows will be returned.This isn't found anywhere within Databricks' documentation, and figuring it out involves a fair bit of Stack Overflow sleuthing. It's also very different behavior compared to many other DBI connections, which could lead to extra confusion. I think it would be helpful to include this tip in the documentation for
odbc::databricks()
.