rilldata / rill

Rill is a tool for effortlessly transforming data sets into powerful, opinionated dashboards using SQL. BI-as-code.
https://www.rilldata.com
Apache License 2.0
1.63k stars 111 forks source link

Runtime: Don't cache connection errors indefinitely #5076

Open begelundmuller opened 2 months ago

begelundmuller commented 2 months ago

In runtime/pkg/conncache, if opening a connection fails with an error, the error is cached indefinitely and connection re-open is not retried. This is fine for deterministic connectors like DuckDB, but makes it hard to recover from transient network errors for connectors like Druid.

It would be nice if we could configure a TTL for how long we consider connection errors valid before trying to connect again.

egor-ryashin commented 2 months ago

Assuming (simplified), conncache has a map of http.Clients structs, those clients don't have a state - if there's a transient error then all users of http-clients will be retrying/failing, once the error is gone they all get successful results again. Besides we have timeouts for Druid requests here https://github.com/rilldata/rill/blob/34b8e4a960db6a678294ac98ea946c7346521a90/runtime/drivers/druid/druidsqldriver/druid_api_sql_driver.go#L128 Could you elaborate?

begelundmuller commented 1 month ago

If opening the connection fails, the returned error is cached here:

https://github.com/rilldata/rill/blob/c4ce5c35d3c0ac9c5004dc4c2464eba0ec0d1c48/runtime/pkg/conncache/conncache.go#L240

Next time something tries to open the connection, the cached error will be returned immediately:

https://github.com/rilldata/rill/blob/c4ce5c35d3c0ac9c5004dc4c2464eba0ec0d1c48/runtime/pkg/conncache/conncache.go#L166

The goal of this issue is to have some kind of TTL on the error caching. After the TTL, it should try to connect again instead of just returning the old error.

egor-ryashin commented 1 month ago

I wonder if we have a state-machine diagram for conncache logic somewhere?

begelundmuller commented 1 month ago

Unfortunately we don't. Feel free to add one in a docstring or README.md file in the conncache package directory.

egor-ryashin commented 1 month ago

If opening the connection fails, the returned error is cached here:

Druid API cannot return an error during openning (it just creates http client struct), did I miss something? https://github.com/rilldata/rill/blob/d7c61fbb9e3adc0bdad0a064c61db7a592c03e82/runtime/drivers/druid/druidsqldriver/druid_api_sql_driver.go#L22

begelundmuller commented 1 month ago

I guess the previous issue we saw came from when Druid used the Avatica driver.

begelundmuller commented 1 month ago

I still think this PR would be nice though, since it may be a problem for other connectors (like ClickHouse). We might also consider adding a ping when opening the Druid connector to be able to show credentials errors earlier.