tursodatabase / libsql

libSQL is a fork of SQLite that is both Open Source, and Open Contributions.
https://turso.tech/libsql
MIT License
9.54k stars 252 forks source link

bottomless: do not skip frames even if there is a temporary issue with S3 #1693

Closed sivukhin closed 1 month ago

sivukhin commented 1 month ago

Context

Current implementation of bottomless skip frames which were failed during upload to S3. This create significant issue with restore process from such backup as bottomless can restore DB only from consecutive frames and first gap in frames will just stop restore process.

So, in case of slight fluctuation in S3 (like sqld network issue or just some degradation of S3) the backup will be "frozen" in time and all changes in the DB made after that fluctuation will not be restore from current generation (but next generation will have full snapshot and issue will disappear after generate bump will happen).

This PR addresses this nuance of bottomless implementation: now instead of bounded queue with frames for upload and fail-fast logic in case of S3 upload failure it has unbounded channel (in order to not block local backup process) with frames for S3 upload and retry logic in case of failure. In case of shutdown bottomless will try to finish all attempts for S3 upload but will not perform any retries (so, in case of shutdown bottomless will work in a "best-effort" mode and can skip some frames in case of S3 issues)

athoscouto commented 1 month ago

This approach could lead to unbound memory usage, or did I get it wrong? Another way to address the same issue is if we skip any number of frames we drop the queue and send a new snapshot. It will increase our networking usage, but memory-wise we should be safer.

sivukhin commented 1 month ago

@athoscouto, yes - this fix leads to unbounded memory usage in case of S3 issues.

I'm not sure about your snapshot idea because if S3 is not working for frames - why it will work for snapshots?

MarinPostma commented 1 month ago

afaik, bottomless won't perform a new snapshot until previous one completed

athoscouto commented 1 month ago

I'm not sure about your snapshot idea because if S3 is not working for frames - why will it work for snapshots?

The idea is that it will work eventually. And until then we don't need to keep track of frames. When it works we just upload a new snapshot from the database file.