rocicorp / replicache

Realtime Sync for Any Backend Stack
https://doc.replicache.dev
1.02k stars 37 forks source link

"The database connection is closing." error again (10.x specific?) #973

Open aboodman opened 2 years ago

aboodman commented 2 years ago

Le sigh. See: https://discord.com/channels/830183651022471199/966042424273150033

Tasks:

  1. [ ] When this error happens (and the Replicache instance hasn't been closed) try to reopen the connection to the idb database. If we find the database is still there resume normal operation. If we find the database has been deleted throw a specific Error indicating this. This change will: a) automatically recover when possible and b) enable discerning between connections randomly closing (which Google search results suggests does happen) and connections closing due to idb database being deleted.
  2. [ ] If (1) proves useful, add an explicit error callback to Replicache for notifying the developer when we detect the idb database has been deleted (similar to https://doc.replicache.dev/api/classes/Replicache#onclientstatenotfound).
  3. [ ] Ensure that all api entry points query/mutator/etc check if Replicache is already closed and throw a specific Error in this case.
  4. [ ] Ensure that all background processes check if Replicache is already closed before trying to access idb (we generally do this but our probably missing a case or two).
tmcw commented 2 years ago

The overall dataset is small, so this might change, but so far this error has only occurred on Chrome 100.

arv commented 2 years ago

I'm intermittently see this in tests too:

▶ npm run test

> replicache@10.0.0-alpha.4 test
> web-test-runner

src/replicache-mutation-recovery.test.ts:

 ❌ successfully recovering mutations of client with same schema version and replicache format version (failed on Webkit)
      InvalidStateError: Failed to execute 'transaction' on 'IDBDatabase': The database connection is closing.
        at readImpl (src/kv/idb-store.ts:158:28)
        at src/kv/idb-store.ts:35:26

Firefox:  |██████████████████████████████| 50/50 test files | 338 passed, 0 failed, 8 skipped
Chromium: |██████████████████████████████| 51/51 test files | 338 passed, 0 failed, 9 skipped
Webkit:   |██████████████████████████████| 50/50 test files | 337 passed, 1 failed, 8 skipped

Finished running tests in 17.2s with 1 failed tests.

@grgbkr

One way to make these go away is to prevent concurrency in the test runner -- but we do want to support concurrency so we need to better deal with exceptions during closing of Replicache. One possibility would be to pass in the AbortSignal through many more layers...

koahmad commented 2 years ago

I'm getting this error frequently in local dev. It might be related to hot-reloading. I still haven't deployed / tested a prod version, but it's a bit annoying in dev.

My Chrome version is 102.0.5005.115 (Official Build) (arm64).

grgbkr commented 2 years ago

Plan to investigate this more today.

@koahmad Can you share the details of your local dev setup?

koahmad commented 2 years ago

Yes -- I'm using NextJS (10.2.3) and use next dev to start the app. What else would be helpful to you?

grgbkr commented 2 years ago

@koahmad thanks, that is enough to help me get started on reproing.

grgbkr commented 2 years ago

More information on this. Another of our customers reports they see this error in their error reporting from prod.

The browser breakdown is image image

Errors from one session: image

grgbkr commented 2 years ago

This can happen in production when

  1. the idb is deleted a. via code b. via devtools c. or by the browser due to hitting size quota
  2. close is called on the replicache instance, and then apis like replicache.query, or replicache.mutate.* are used.
  3. the browser closes the idb connection for some other reason (it shouldn't but google results suggests it randomly does).

For 1 we can't recover, but we could detect and notify the developer via an error callback (similar to https://doc.replicache.dev/api/classes/Replicache#onclientstatenotfound).

For 2 we should check if already closed in query/mutate/etc, and throw a specific error.

For 3 we should try to recover by re-openning the idb conneciton.

aboodman commented 2 years ago

Perhaps we could collapse (1) and (3) by trying to reopen and if we find that the db is gone (😬) notifying user.

grgbkr commented 2 years ago

Wanted to share an update on our current understanding of the impact of this bug when it occurs in production. If the client view is < 100MBs in size the impact will be largely invisible to end users.

When the client view is < 100MB the impact for the affected tab is:

  1. pending mutations are not persisted locally, and so if the tab is closed before pending mutations are pushed they will be lost (as mutation recovery can not recover non-persisted mutations)
  2. newly synced snapshots are not persisted, resulting in new tabs having to sync a bit more on start up

Other than the above the client will function normally. Mutators, querys, subscribe and watches all continue to work. Sync will continue to work. This is because all of these work against a memory cache that is lazyily persisted to IndexedDB. If the connection to IndexedDB is lost they all continue to work correctly against the memory cache and things just fail to persist.

If the client view is > 100MBs, there is more likely to be end user visible impact. Currently Replicache's memory cache has a size limit of 100MB (using a LRU eviction strategy). If the IndexedDB connection is lost and there is a memory cache miss during execution of a Replicache operation (mutator, query, subscribe, watch, push or pull), then that operation will fail.

grgbkr commented 2 years ago

Ok I think we now have a solid plan for addressing this:

  1. When this error happens (and the Replicache instance hasn't been closed) try to reopen the connection to the idb database. If we find the database is still there resume normal operation. If we find the database has been deleted throw a specific Error indicating this. This change will: a) automatically recover when possible and b) enable discerning between connections randomly closing (which Google search results suggests does happen) and connections closing due to idb database being deleted.
  2. If (1) proves useful, add an explicit error callback to Replicache for notifying the developer when we detect the idb database has been deleted (similar to https://doc.replicache.dev/api/classes/Replicache#onclientstatenotfound).
  3. Ensure that all api entry points query/mutator/etc check if Replicache is already closed and throw a specific Error in this case.
  4. Ensure that all background processes check if Replicache is already closed before trying to access idb (we generally do this but are probably missing a case or two).

We will do (1) this week, and (2) shortly after.

We should hold off on (3) and (4) until after https://github.com/rocicorp/replicache/issues/1009. The full offline support change involves a lot of significant refactors, and I worry if we do this before offline there is a non-trivial chance of those refactors regressing it. Also (3) and (4) won't improve anything for end users, they just provide nicer error reporting for developers.

aboodman commented 2 years ago

I think eventually there should also be a (5) which is a bigger project:

grgbkr commented 2 years ago

I think eventually there should also be a (5) which is a bigger project:

  • some sort of first-class feature around managing quota and eviction.

Agree, though I'd track that feature as a separate issue.

grgbkr commented 2 years ago

Adding some relevant bug reports about unexpected idb connection closes from other projects: https://bugs.webkit.org/show_bug.cgi?id=197050 https://bugs.chromium.org/p/chromium/issues/detail?id=1085724 https://github.com/jensarps/IDBWrapper/issues/80 https://github.com/firebase/firebase-js-sdk/issues/1533

aboodman commented 2 years ago

Update: There's a new beta release of Replicache: 11.2.1 that contains code to help us understand what's happening here. If you are seeing this error, please install:

npm add replicache@11.2.1

Then deploy to production. You should see one of two things happen:

  1. The error goes away. In that case, the rumors are true: Chrome will sometimes mysteriously close connections to IDB and just reopening it fixes it 🤷‍♂️.
  2. The error changes to Replicache IndexedDB not found: <name>. In that case, the databases are getting deleted -- perhaps due to quota or some other reason.

Please let us know what happens. Thanks!

tmcw commented 1 year ago

So, it went away, it seemed, but it's back - still seeing these issues cropping up in development.

CleanShot 2022-10-07 at 10 23 48@2x

aboodman commented 1 year ago

What version of Replicache?

tmcw commented 1 year ago

At currently:

=> Found "replicache@11.3.1"

I'm still pretty interested in a version of Replicache that doesn't use IDB. I really think the costs outweigh the benefits of the local-storage approach for my usecase, and the browser implementations of IDB are apparently really flawed.

arv commented 1 year ago

@grgbkr said he had some ideas why there is still a possible case of this.

grgbkr commented 1 year ago

Some good new on this. We found a way to easily reproduce this by using replicache in a codesandbox. The super aggressive HMR makes it easy to reproduce. Now that we can repro, we can more reliably debug and fix.