man-group / ArcticDB

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
http://arcticdb.io
Other
1.46k stars 93 forks source link

Retry on transient Mongo errors #1593

Open IvoDD opened 4 months ago

IvoDD commented 4 months ago

Is your feature request related to a problem? Please describe. We have different retry mechanisms for different storages. E.g. S3's sdk does retry & backoff on retryable errors. However for other storages we're missing the needed retry logic. (E.g. Mongo storage doesn't retry on connection errors.)

Describe the solution you'd like Ideally we'd have a common exception handling mechanism which is independent of the storage. This way we can have common configuration parameters regarding the retries. One downside with this approach would be that the retry logic might not be optimal for the specific storage. E.g. default S3 exception handling will probably be better than what we can do which can work with all storages.

At the very least we should add some retry logic for the storages which don't have proper retry & backoff logic.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Internal slack thread discussing the issue here. In short: a user was getting E_UNEXPECTED_MONGO_ERROR and was asking whether they should retry on them. We decided to let them retry for now but we should properly fix our retry behavior soonish.

IvoDD commented 4 months ago

From Hamza: We could catch all the retryable errors (at storage level) into a new exception category E_STORAGE_RETRYABLE, then this can be retried outside storages. This way at a storage level we decide which exceptions are retryable and outside storage we do the retries so we know all the storages behave the same way for handling retryable exceptions, if that is what we want.